While the field of data science is filled with tests, procedures for managing data, and sometimes some brain bending ideas, some ideas in data science are harder than others. For example, we might want to know more information about how centralized or dispersed data is. These types of relatively simple things are often called descriptive statistics.
Most data science toolkits have built in a number of equations out of the box to handle these operations. In this section, we provide code samples for each summary. Further, we have provided the specific equations Quorum uses and have placed these in an accessible format in the documentation.
Often when learning about data analysis, it can be helpful to look at multiple datasets during the learning process. The reason is because different scientists format their data in different ways, make different assumptions, and we need to adapt to these to do our analysis correctly. In this section, we are going to be looking at new data, specifically from bean classification. Thus, this data set is regarding blood flow in the heart and the data we will be using is looking at patients with heart difficulties. The data can be found from a broader collection of data from the University of California, Irvine Machine Learning Repository.
The citation for all of these data sets is this one:Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
We are going to be going over the dry bean data set, which can be found here: Dry Beans Dataset .
: KOKLU, M. and OZKAN, I.A., (2020), 'Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.' Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507. https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset.
In the next tutorial, we will discuss loading file, which describes loading the dataset using dataframes.