Calculating the Mode
In data science, the mode is typically the number that occurs the most frequently in a dataset. Mode is unique because in a numerical dataset, there can be three types of modes: no mode, one mode, or multiple modes. In mode allows scientists to know what values would be most sampled in a dataset.
The mode is calculated similarly to the others. We create the Mode class, then pass it to the column. Like the others, undefined values are automatically removed. One difference is that with Mode, if the column has more than one value that has an equally high frequency in the data set, like 1,1,2,2,3,3, then all three modes are returned. If we have a case where there are no duplicates at all, like 1,2,3,4,5,6, then the modes are technically all values in the column and this is returned. We can detect this case by calling the HasDuplicates() action by calling it inside the Mode class.
This method we are doing will find all modes in a single column. In order to get the mode of a single column, we first need to extract it and store it inside a DataFrame column, named col. For this example, we will grab the 'Perimeter' column by calling GetColumn(int columnNumber) and insert 1 as the perimeter.
DataFrameColumn col = frame:GetColumn(1)
Then, we need to create a mode class object, and note that this mode object takes account of any missing data so 0's would not be counted. We then use our col and call Calculate where we insert the mode object we also made as the parameter.
Then, we want to create a number array, called 'modes' and store all the modes found from our previous calculation and then sort it using the Sort() function on the array. This is so we can check if there are any duplicates as mentioned previously. To check for duplicates, we iterate through the 'modes' array with a repeat loop and output any of the modes with the most duplicates, so in this case we may result in many modes as the result. We can verify if the dataset contains duplicates with the function HasDuplicates() which is called from the mode object.
Here are brief descriptions of the objects and functions we have used throughout this tutorial:
|Function / Class||Description||Usage|
|Mode objectName||This is the mode object used to calculate all the possible modes within a dataset.||Mode mode|
|modeObjectName:GetModes()||Using an array, it retrieves all the modes found in a dataset.||Array|
|modeObjectName:HasDuplicates()||This checks the mode object for any duplicate values to determine if there are multiple modes.||mode:HasDuplicates()|
Here is some code on how to calculate the mode:
//We need the DataFrame class to load in files for Data Science operations. use Libraries.Compute.Statistics.DataFrame use Libraries.Compute.Statistics.DataFrameColumn use Libraries.Compute.Statistics.Calculations.Mode use Libraries.Containers.Array //Create a DataFrame, which is essentially a table that understands //more information about the data that is being loaded. DataFrame frame //This loads data relative to the project, so put the dryBeans file in the Data/Miscellaneous folder frame:Load("../Data/Miscellaneous/DryBeans.csv") //Get the column we want, in this case "perimeter" DataFrameColumn col = frame:GetColumn(1) //The calculation for means automatically takes missing data into account Mode mode //we pass the Mean class to the column, which does the calculation and stores the answer //We can then get the answer in code or do something else with it, like output it to the screen col:Calculate(mode) Array
modes = mode:GetModes() modes:Sort() i = 0 repeat while i < modes:GetSize() output modes:Get(i) i = i + 1 end output mode:HasDuplicates()
Run the Example
Example of calculating the mode
Congrats! We have just learned how to calulate the mode! To view the whole file, we can click here.
In the next tutorial, we will discuss variance and standard deviation, which describes calculating the variance and standard deviation.