Descriptive statistics are foundational to the understanding of data science. As a matter of fact, the most commonly taught data science concepts within schools are topics related to central tendency (e.g., mean, median, mode) and dispersion (e.g., variance, standard deviation). For this session, we will review these concepts and practice programming them using a real data set.
In this tutorial, students will learn:
- Students will be able to explain the difference between quantitative and categorical variables.
- Students will be able to determine what types of descriptive statistics are used with specific types of data.
- Students will be able to use the online system and Quorum Studio to calculate mean, median, mode, variance, and standard deviation
Obtain and Examine the Dry Beans Dataset (5 minutes)
The Quorum server contains a file we will use for these examples. In the first few minutes, obtain the Dry Beans CSV file. This dataset has been slightly modified from the original from the UC Irvine Machine Learning repository and focuses on the classification of seven bean types based on attributes such as area, perimeter, etc. Notice that the data set is large enough that calculating information by hand would be tedious. We will write software to give us information about the data.
Calculating Mean (10 minutes)
The mean is commonly referred to as the 'average' and is computed by adding all the terms and dividing by the number of terms. It is considered the 'balancing point' of a set of data. We will reference the mean tutorial to help us run these programs.
Calculating Median (10 minutes)
The median is 'middle value in a group of ordered observations' and accounts of skewed distributions or when there are outliers that impact the mean. What skewed distributions mean is that if our data lies is visualized in a bell curve, we can see if one tail of the data leans more to one side versus the other side. Below is an image of different skewed distributions and how that affects our median.
We will reference the median tutorial to help us run these programs.
Calculating Mode (10 minutes)
The mode is the value that appears most frequently within a data set. Because mode is not guaranteed to be unique, the code for managing mode is more complicated than the other two measures of central tendency. We will reference the mode tutorial for this one. Note that it is not critical to understand all aspects of the code for mode, but on the other hand, it is important to know that code for accessing this is available and where to find it if we forget. After all, no programmer can possibly remember every permutation of code. As such, we use references to remind ourselves and find new information quite regularly.
Calculating Variance and Standard Deviation (15 minutes)
Variance and standard deviation are measures used to determine the variability found within a dataset and are examples of understanding dispersion. We will reference the variance and standard deviation tutorial for this session.
Wrap-up with MathJax (10 minutes)
In this wrap-up, note that each tutorial uses MathJax to represent the mathematical equations. Consider using a screen reader on a page, with a partner, to walk through how to use the equations using this modality.
To use MathJax capabilities with our equations, all we need to do is right click the equation. We will then be seeing an option page that helps us with accessibility whether it is activating the screen reading capabilities or increasing the size of such equations.
To learn more about the accessibility features of MathJax, take a look at this accessibility features guide.
Relevant Common Core Standards
We use the following website for common core standards in relation to histograms and measurements of distribution.
CCSS.MATH.CONTENT.6.SP.A.3: Recognize that a measure of center for a numerical data set summarizes all of its values with a single number, while a measure of variation describes how its values vary with a single number.
CCSS.MATH.CONTENT.HSS.ID.A.2: Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.
CCSS.MATH.CONTENT.HSS.ID.A.4: Use the mean and standard deviation of a data set to fit it to a normal distribution and to estimate population percentages. Recognize that there are data sets for which such a procedure is not appropriate. Use calculators, spreadsheets, and tables to estimate areas under the normal curve.
CCSS.MATH.CONTENT.HSS.ID.A.3: Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).
In the next tutorial, we will discuss Histograms, Skew, and Kurtosis, which describes Understanding distribution with historgrams, kurtosis, and skew using Quorum..