Calculating the Standard Devation from Mean
In data science, the standard deviations from the mean is a way to determine whether a data set has outliers. Outliers are data points that have an abnormal distance from the other values in a dataset. One reason we do this is because if we observe very extreme outliers, it is rational to consider whether the data could have been coded incorrectly or whether the data is real. One easy way to do this is to calculate the number of standard deviations away from the mean. In statistics, the technical term for this is 'z-scores.' This is why the equation below uses z as a label.
We can calculate z-scores, or the number of standard deviations from the mean, for an entire column using the StandardDeviationsFromMean class. In addition, this class has an action, CalculateStandardDeviationFromMean, which can return us the answer for a specific value. This returns to us a new column with numbers in it. Notably, a score of 0 means that the value for that data point was at the mean. A score of 1, or -1, means that the data point was one standard deviation above or below the mean, respectively. Having these values for a column lets us quickly skim them, or filter them, looking for points that may not belong.
The standard deviations from mean can be calculated in Quorum by using the StandardDeviationsFromMean class and passing it to a column. To do this, we will use our 'frame' object and will get the column by calling frame:GetColumn() function. In this case we will be calculating the standard deviations from mean of the area of dry bean classifications. Here is a brief description on how StandardDeviationsFromMean () works.
|Calculate(zscores)||Calculates the Standard Deviations from Mean and stores the answer in the variable that is inside the parentheses||col:Calculate(zscores)|
|GetResultColumn()||Returns a column with all of the calculated standard deviations||NumberColumn result = zscores:GetResultColumn|
Here is the full code on how to find the standard deviations from the mean of a numerical dataset:
//We need the DataFrame class to load in files for Data Science operations. use Libraries.Compute.Statistics.DataFrame use Libraries.Compute.Statistics.DataFrameColumn //This is the calculation for the arithmetic mean use Libraries.Compute.Statistics.Calculations.StandardDeviationsFromMean use Libraries.Compute.Statistics.Columns.NumberColumn //Create a DataFrame, which is essentially a table that understands //more information about the data that is being loaded. //Using the default loader is enough for our purposes DataFrame frame frame:Load("../Data/Miscellaneous/DryBeans.csv") //Get the column we want, in this case "survival" DataFrameColumn col = frame:GetColumn(0) //The calculation automatically and take missing data into account StandardDeviationsFromMean zscores //we pass the Mean class to the column, which does the calculation and stores the answer //We can then get the answer in code or do something else with it, like output it to the screen col:Calculate(zscores) NumberColumn result = zscores:GetResultColumn() output result:ToText()
Run the Example
Example of calculating the standard deviations from the mean (z-scores)
Congrats! We have just learned how to calulate the z-score! To view the whole file, we can click here.
In the next tutorial, we will discuss skew, which describes calculating the skew.