Data Science

Histograms, Skew, and Kurtosis

Understanding distribution using histograms, kurtosis, and skew

Learning Objectives

A histogram is a graphical representation that organizes a group of data points into user-specific rangers. It synthesizes a data series into an easily understandable visual representation by taking many data points and grouping them into logical ranges (or bins).

While a histogram looks similar to a bar chart (since they both use bars to represent the data), they are not technically the same. Histograms represent the frequency of distribution of variables in a data set while a bar chart typically represents a graphical comparison of categorical variables.

The student will learn:

Students will be able to create a histogram from a dataset.
Students will be able to label information displayed on the histogram.
Students will be able to customize histogram.
Students will be able to access the information from the histogram using assistive technology.

To create a histogram, we will begin by creating a DataFrame. In order to do this, we will need to first download the Airbnb NYC dataset dataset (as a CSV file). We will then need to add the Factors and Columns before running the program to display the histogram. In this lesson we can follow the tutorial available on our website: Histograms

Creating a Histogram using Quorum Studio (10 Minutes)

To start us off we will download the dataset, so that we can have the means to create a bar chart.You can follow the link to download Airbnb NYC dataset dataset. If you are lost on how to download the dataset from GitHub, here is a link to our tutorial, Downloading CSVs for our charts

The Airbnb dataset describes public data of Airbnbs in New York City such as the name of the Airbnb, the host, the neighborhood, the price, room type, etc. It has 15 columns and 250 entries (rows) of listed Airbnb stays. We will be using one column, 'Price' for this lesson. Here is a snippet of what this dataset looks like:

Airbnb Prices in NYC CSV
ID	Name	Host ID	Host Name	Neighborhood
2539	Clean & quiet apt home by the park	2787	John	Brooklyn
2595	Skylit Midtown Castle	2845	Jennifer	Manhattan
3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan
3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn

To start creating a Histogram, we need to import two libraries. DataFrame, which is a frame where our chart will be drawn, and Histogram, which allows us to create a histogram. Then we will initialize a DataFrame and load a comma separated values to the frame.

Reading and Displaying Data (5 minutes)

Now that we have the histogram created, let us load the data from the Airbnb NYC dataset and display the histogram.

The steps that we describe in this task can be followed in the 'Loading and Formatting' section that is available in our Histograms tutorial.

Currently, we have only loaded the comma separated value file into the dataframe, and we are not reading anything from it. The first thing that we need to do is to load the columns and factors to be read by our bar chart. Specifically, we will add the one column that we will be using in this tutorial: 'Price.' After we have loaded the column, we need to create a histogram object using the frame that we have filled. Finally, we can display the content of that histogram. Next, we will be labeling and customizing our histogram.

Labeling the Histogram (10 minutes)

The next step is to be able to add specific labels to the entire histogram, both axes, adding subtitles, and changing the font size of the print. Adding these features will allow us to present our data in a clearer way.

The steps that we describe in this task can be followed in the 'Labeling the Histogram' section that is available in our Histograms tutorial.

First, we will add labels for our chart, x-axis, and y-axis. This allows the reader to distinguish between the information and to understand what data they are looking at. For our chart we can label it the following way: title - 'Price per night with AirBnB in 2019 (NYC)'; x-axis - 'Price ($)'; y-axis - 'Number of Stays.' If we feel like those labels are not enough, we can add a subtitle to the histogram. For example, for this histogram the subtitle can be 'How expensive is it to stay in NYC?'

Customizing the Histogram (5 minutes)

For this lesson, we will explore how to change the color palette, change the tick interval, and adjust the x axis minimum value. The lesson uses the warm palette but there are many options of colors to choose from to attempt; regarding other color palettes we have reference using the information on the Color Accessibility page.

The steps that we describe in this task can be followed in the 'Customizing the Histogram' section that is available in our Histograms tutorial. Changing the tick interval involves the range between two values in our histogram. We want to also modify the x axis to demonstrate further customization of our histogram by being able to adjust what value our chart can start on.

Accessing the Histogram (5 minutes)

Now let us explore the graphic using the accessibility tools on our devices. When the histogram has been created, we should see our chart pop up in a separate window. From there, we can reference this tutorial on keyboard navigation using the arrow keys for accessibility. One note is that when a bar chart is saved onto our computer devices, they save as an SVG (scalable vector graphic) which maintains not only resizability but accessibility to read these charts using a screen reader.

Relating our Histogram to Skew and Kurtosis (10 Minutes)

In math, we often have two primary concepts when thinking about histograms, Skew and Kurtosis. We measure skew as a number, positive or negative, which references whether the data set is shifted in one direction or the other. A skew of 0 means that the data is symmetrical around the mean. A positive skew, or right skew, indicates the tail of the data is longer above the mean. We can follow this tutorial on how to calculate skew in Quorum.

Skew= \frac{n}{(n - 1) (n - 2)} \sum_{i = 1}^{n} {(\frac{x_{i} - \bar{x}}{s})}^{3}

Kurtosis relates to determining the heaviness of the distribution. Having high kurtosis means that in the dataset, we have many outliers and having low kurtosis means a lack of outliers. There are technical terms we can look up for these properties, but they are used rarely even in the academic literature and are not important to memorize. We can follow this tutorial on how to calculate kurtosis in Quorum.

Kurtosis= \frac{n (n + 1)}{(n - 1) (n - 2) (n - 3) s^{4}} \sum_{i = 1}^{n} {(x - \bar{x})}^{4} - \frac{3 {(n - 1)}^{2}}{(n - 2) * (n - 3)}

Note that while we present the equation here in MathJax form (visual and accessible), memorizing this equation is not important. The entire purpose of data science is to abstract away some of this mathematics in a programming language, to obtain the results of the equation without having to understand the nuance. If we did not do that, realistically data science would be too difficult for any one person to understand.

Real World Examples of Skew and Kurtosis (15 Minutes)

In this section, we will be showing graphs of skew and kurtosis to better understand the shape and how we can visually see these measurements using a histogram.

Skew

For this exercise we will be using two datasets which can be downloaded from the following links: Number of Children CSV and Exam Scores CSV. The graph with the number of children will demonstrate a histogram with a right skew while the graph with Exam Scores will demonstrate a histogram with a left skew.

Right Skew

use Libraries.Compute.Statistics.DataFrame
use Libraries.Interface.Controls.Charts.Histogram

/*
    This is an example of a Histogram built in quorum.
    The dataset we will be working with compares life spans of various dog breeds
*/

// create frame component
DataFrame frame
// read in data from dog csv
frame:Load('../Data/Miscellaneous/numberOfChildren.csv')

// pull out specific columns from csv that we are comparing
// note: histograms do not support factors
frame:AddSelectedColumns('Number of Children')

// create histogram object
// this will create two separate data charts to compare
Histogram chart = frame:Histogram()
chart:SetTitle('Number of Children for Arbitrary Families')
// let's adjust the font size so it appears nicely on the screen
chart:SetTitleFontSize(20)

// label the x axis, y axis, and the legend title
chart:SetXAxisTitle('Number of Children')
chart:SetYAxisTitle('Total Individuals')
chart:SetLegendTitle('People who have children')
// add subtitle for more description
chart:SetSubtitle('Age ranges of Dog Breeds')

// customization features
chart:SetColorPaletteToTrustworthy()
// define a clear interval, we separate each interval by 2
chart:SetXTickInterval(1)
// let's start our chart at 0 to examine a curve as a whole
chart:SetXAxisMinimum(0)
// let's also extend our y axis to see the skew
chart:SetYAxisMaximum(40)
chart:SetXAxisMaximum(10)

// display the histogram
chart:Display()

Left Skew

use Libraries.Compute.Statistics.DataFrame
use Libraries.Interface.Controls.Charts.Histogram

/*
    This is an example of a Histogram built in quorum.
    The dataset we will be working with compares life spans of various dog breeds
*/

// create frame component
DataFrame frame
// read in data from dog csv
frame:Load('../Data/Miscellaneous/exams.csv')

// pull out specific columns from csv that we are comparing
// note: histograms do not support factors
frame:AddSelectedColumns('math score')

// create histogram object
// this will create two separate data charts to compare
Histogram chart = frame:Histogram()
chart:SetTitle('Number of Children for Arbitrary Families')
// let's adjust the font size so it appears nicely on the screen
chart:SetTitleFontSize(20)

// label the x axis, y axis, and the legend title
chart:SetXAxisTitle('Number of Children')
chart:SetYAxisTitle('Total Individuals')
chart:SetLegendTitle('People who have children')
// add subtitle for more description
chart:SetSubtitle('Age ranges of Dog Breeds')

// customization features
chart:SetColorPaletteToTrustworthy()
// define a clear interval, we separate each interval by 2
chart:SetXTickInterval(5)
// let's start our chart at 0 to examine a curve as a whole
chart:SetXAxisMinimum(0)
// let's also extend our y axis to see the skew
chart:SetYAxisMaximum(20)
chart:SetXAxisMaximum(100)

// display the histogram
chart:Display()

Kurtosis

For this exercise we will be using two datasets which can be downloaded from the following links: Iris classification CSV and Electric Cars (EVs) CSV. The graph with Irises will demonstrate a histogram with a high kurtosis while the graph with EVs will demonstrate a histogram with a low kurtosis.

High Kurtosis

use Libraries.Compute.Statistics.DataFrame
use Libraries.Interface.Controls.Charts.Histogram

/*
    This is an example of a Histogram built in quorum
*/

// create frame component
DataFrame frame
// read in data from dog csv
frame:Load('../Data/Science/Iris.csv')

// pull out specific columns from csv that we are comparing
// note: histograms do not support factors
frame:AddSelectedColumns('SepalWidthCm')

// create histogram object
// this will create two separate data charts to compare
Histogram chart = frame:Histogram()
chart:SetTitle('Iris Classification')
// let's adjust the font size so it appears nicely on the screen
chart:SetTitleFontSize(20)

// label the x axis, y axis, and the legend title
chart:SetXAxisTitle('Sepal Width (CM)')
chart:SetYAxisTitle('Number of Irises')
chart:SetLegendTitle('Measurements of Iris')
// add subtitle for more description
chart:SetSubtitle('Age ranges of Dog Breeds')

// customization features
chart:SetColorPaletteToTrustworthy()
// define a clear interval, we separate each interval by 2
chart:SetXTickInterval(0.2)
// let's start our chart at 0 to examine a curve as a whole
chart:SetXAxisMinimum(0)
chart:SetXAxisMaximum(6)

// display the histogram
chart:Display()

Low Kurtosis

With this dataset example, take notice in the first half of the histogram, there is a smaller bell curve which represents the low kurtosis.

use Libraries.Compute.Statistics.DataFrame
use Libraries.Interface.Controls.Charts.Histogram

/*
    This is an example of a Histogram built in quorum
*/

// create frame component
DataFrame frame
// read in data from dog csv
frame:Load('../Data/Miscellaneous/Cars 1.csv')

// pull out specific columns from csv that we are comparing
// note: histograms do not support factors
frame:AddSelectedColumns('Dimensions.Width')

// create histogram object
// this will create two separate data charts to compare
Histogram chart = frame:Histogram()
chart:SetTitle('EV Cars')
// let's adjust the font size so it appears nicely on the screen
chart:SetTitleFontSize(20)

// label the x axis, y axis, and the legend title
chart:SetXAxisTitle('Car Dimension')
chart:SetYAxisTitle('Number of Cars')
chart:SetLegendTitle('Attributes of EV Cars')
// add subtitle for more description
chart:SetSubtitle('Age ranges of Dog Breeds')

// customization features
chart:SetColorPaletteToTrustworthy()
// define a clear interval, we separate each interval by 2
chart:SetXTickInterval(10)
// let's start our chart at 0 to examine a curve as a whole
chart:SetXAxisMinimum(1)
chart:SetXAxisMaximum(250)

// display the histogram
chart:Display()

With these examples, let us take time to discuss the differences between each graph and examine the shape in relation to skew and kurtosis.

Relevant Common Core Standards

We use the following website for common core standards in relation to histograms and measurements of distribution.

CCSS.MATH.CONTENT.HSS.ID.A.1: Represent data with plots on the real number line (dot plots, histograms, and box plots).CCSS.MATH.CONTENT.HSS.ID.A.2: Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.CCSS.MATH.CONTENT.HSS.ID.A.3: Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers). CCSS.MATH.CONTENT.HSS.ID.A.4: Use the mean and standard deviation of a data set to fit it to a normal distribution and to estimate population percentages. Recognize that there are data sets for which such a procedure is not appropriate. Use calculators, spreadsheets, and tables to estimate areas under the normal curve.

Next Tutorial

In the next tutorial, we will discuss Pie Charts vs. Stacked BarCharts, which describes Understanding statistics and how to calculate important values using Quorum..

Go Back Next Tutorial