Box Plots, Violin Plots, and Understanding Dispersion

A box plot (also known as box and whisker plot) is a type of chart used to visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. Similar to box plots, violin plots also depict percentiles, however we can see how dense our data is. Overall, we often use chart types like these to better understand dispersion, or how and where our data varies.

Learning Objectives:

1. Students will be able to create a box plot and violin from a dataset.
2. Students will be able to label information displayed on the box plot, customize it, and use assistive technology with a box plot.
3. Students will be able to label information displayed on the violin plot, customize it, and use assistive technology with a violin plot.
4. Students will be able to understand the dispersion in relation to these two plots.

Understanding Dispersion through Box Plots and Violin Plots (15 Minutes)

Dispersion, also known as the variability, scatter, or spread, measures how data is either squeezed or stretched. Measures of dispersion would be variance, standard deviation, and interquartile range. Variance is defined as how far the data set is spread out while standard deviation tells us how far the data is spread out.

Standard Deviation: ${s}_{x}=\sqrt{\frac{1}{n-1}\sum _{i=1}^{n}\left({x}_{i}-\overline{x}\right)}$Variance: ${\sigma }_{x}^{2}=\frac{1}{n}\sum _{i=1}^{n}\left({x}_{i}-\overline{x}\right)$

With high dispersion, this means that the data values are more spread out / scattered, but if there is a low dispersion, this means that the data values are more clustered. We can learn more about how to calculate standard deviation and variance using the following tutorial.

Probability Density

We have a violin plot showcasing high and low dispersion. We can notice an obviously low dispersion in the last column of repeating violin plots meaning lower density. The violin plot is separated by what conditions the weather is in and we examine different factors according to the weather. A violin plot works great to see the changes in temperature since it displays the density.

Weather Conditions in the US , Violin Plot with 4 groups and 16 plots. Use the arrow keys to navigate chart information and Tab to access the chart content.
X axis shows Main Condition and has values Clear, Clouds, Rain, and Snow.
Y axis shows Temperature and ranges from -5 to 115.
Legend shows Temperature Factors and has values Low Temperature, High Temperature, Humidity Percentage, and Wind MPH.

Interquartile range (IQR) is another measure of dispersion that fits well using a box plot. IQR measures the spread of the middle half of our data, meaning it takes the upper quartile of our data and subtracts it from the lower quartile of our data to get the middle (50%). The IQR will give us a measurement of how spread out our entire dataset is. We can learn more about how to calculate the IQR with the following tutorial.

IQR on a Box Plot

Here we have a box plot to examine the IQR. Note that the IQR is equal to subtracting the upper quartile from the lower quartile. Box plot separates itself into four sections of 25% where we can see the upper and lower quartles, as well as the median.

Similar to histograms, box plots and violin plots also help describe the shape to determine dispersion within our datasets. Box plots are great tools for basic summary statistics such as finding IQR, skew, outliers, etc. However, we do not actually see the dispersion in the data. On the other hand, violin plots provide visual dispersion through its display of peaks. This is important because we depict summary statistics and the density of each variable.

Now, let us view some violin plots and box plots with high and low dispersion. We will be using a dataset involving exam scores for both plots, and we can view this dataset by clicking on the following link: exam score CSV.

This is an example of a box plot with low dispersion and high dispersion regarding test scores. Notice how values are closer together / compact versus being spread out for subjects such as reading. We can also see that math does have a higher dispersion compared to the reading and writing scores. We can assume that there is more variation of lower scores for math due to difficulty in understanding such topics compared to English and writing.

``````use Libraries.Compute.Statistics.DataFrame
use Libraries.Interface.Controls.Charts.BoxPlot

// create frame component
DataFrame frame
// read in data from height of male and female by country 2022 csv

// pull out specified columns from csv that we are comparing

// create histogram object
// this will create two separate data charts to compare
BoxPlot chart = frame:BoxPlot()
chart:SetTitle("Randomized Data on People")
// let's adjust the font size so it appears nicely on the screen
chart:SetTitleFontSize(20)

// label the x axis, y axis, and the legend title
chart:SetXAxisTitle("Children and Years of Experience")
chart:SetYAxisTitle("Demographics")
// add subtitle for more description
chart:SetSubtitle("An arbitrary dataset about random people and their lives")
chart:HideOutliers()

// customization features
chart:SetColorPaletteToPlayful()
// define a clear interval, we separate each interval by 10
chart:SetYTickInterval(10)
chart:SetXAxisMinimum(0)
chart:FlipOrientation()

// display the box plot
chart:Display()
``````

Example of a Box Plot with high and low dispersion

How we can graphically see high and low dispersion

Now, this is the same dataset but in the form of a violin plot. Let us notice the peaks of dispersion compared to sole values in a box plot. We can see the density of the dataset and visualize the distribution of test scores. Here we can easily see how many students have scored in the range between the upper and lower quartiles. We can still make the same assumption from the box plot on this data, but now we see exactly student performance more holistically.

Since violin plots are fairly similar to box plots, we only have to change the following 2 lines of code:

``use Libraries.Interface.Controls.Charts.BoxPlot``

changes to

``use Libraries.Interface.Controls.Charts.ViolinPlot``

and

``BoxPlot chart = frame:BoxPlot()``

changes to

``ViolinPlot chart = frame:ViolinPlot()``

Example of a Violin Plot with high and low dispersion

How we can graphically see low and high dispersion

Notice with each test score, it exhibits both a high and low dispersion. Many of the lower scores acculumated as the higher dispersion (ranges 320 to 340) while the higher scores are shown in the tails which shows a low dispersion (ranges 590 to 740).

Creating a Box Plot using Quorum Studio (10 minutes)

To create a box plot, we will begin by creating a DataFrame. In order to do this, we will need to first download the gender height dataset (as a CSV file). We will then need to add the Factors and Columns before running the program to display the histogram. In this lesson we can follow the tutorial available on our website: box plots.

The gender dataset describes the average height of males and females in different countries. It contains six columns and 200 entries (rows) of data. We will be using two columns within this dataset: 'Male Height in Cm' and 'Female Height in Cm.'

Height of Male and Female by Country 2022 CSV
Country Name Male Height in Cm Female Height in Cm
Netherlands183.78170.36
Montenegro183.30169.96
Estonia182.79168.66
Bosnia and Herzegovina182.47167.47
Iceland182.10168.91

To start creating a Box plot, we need to import two libraries: DataFrame, which is a frame where our chart will be drawn, and Box Plot, which allows us to create a box plot. Then we will initialize a DataFrame and load a comma separated values to the frame.

In order to read our dataset properly into Quorum, we will be using the action, Load(text location) which is found within our DataFrame class. Notice that inside the action's parameters, we will be inserting the file location of our gender height dataset.

Now that we have read in our data, we can select our desired columns: male and female height in centimeters. For this, we will be using the action, AddSelectedColumns(text heading) which is found within our DataFrame class. Inside the parameters, we will want to type the heading from our dataset which will find and insert the selected heading into our DataFrame. We can reference the 'Loading and Formatting' Section in our box plots tutorial regarding these steps.

Finally, we can create our box plot object using our frame with the following line of code:

``BoxPlot chart = frame:BoxPlot()``

Right now, we have a simple box plot, but there is not much to it. We will proceed on organizing our box plot with labels and further customization methods in the next section.

Labeling and Customizing the Box Plots (5 minutes)

The steps that we describe in this task can be followed in the 'Labeling the Box Plot' section that is available in our box plots tutorial.

First, we will add labels for our chart, x-axis, and y-axis. This allows the reader to distinguish between the information and to understand what data they are looking at. For our chart we can label it the following way: title - 'Height of Males and Females in the World'; x-axis - 'Sex'; y-axis - 'Height (cm).' If we feel like those labels are not enough, we can add a subtitle to the histogram. For example, for this histogram the subtitle can be 'What is the average height of the population by sex?'

Other features we can add to further customize our box plot is that we can modify the font size, the location of our legend, the color palette, and even change the orientation of the box plot. Many of these features are available across all charts in Quorum. One unique function in regards to box plots is that we have access to HideOutliers() which will ignore any outlier data. We can reference the section 'Customizing the Data Chart' for more information in the box plots tutorial.

Accessing the Box Plot (5 minutes)

To view our actual plot we can type the following line of code:

``chart:Display()``

This displays our chart via pop-up box inside of Quorum or on the side view of the Quorum online editor! Now let us explore the graphic using the accessibility tools on our devices. When the box plot has been created, we should see our chart pop up in a separate window. From there, we can reference this tutorial on keyboard navigation using the arrow keys for accessibility. One note is that when a bar chart is saved onto our computer devices, they save as an SVG (scalable vector graphic) which maintains not only resizability but accessibility to read these charts using a screen reader.

Creating a Violin Plot using Quorum Studio (10 minutes)

A violin plot is a hybrid of a box plot and a kernel density plot (which shows peaks in the data). It is used to visualize the distribution of numerical data. Unlike a box plot (that shows summary statistics), violin plots show summary statistics and the density of each variable.

To create a violin plot, we will begin by creating a DataFrame. In order to do this, we will need to first download the cats dataset (as a CSV file). We will then need to add the Factors and Columns before running the program to display the pie chart. In this lesson we can follow the tutorial available on our website: violin plots

The Cats dataset includes information like each cat's name, weight and temperament. It has 9 columns and 68 entries/rows of cats. We will be focusing on the 'Life Span' column for this lesson, but in general any columns with a range would work (eg, weight, but not temperament). Here's a snippet of the dataset:

Cats CSV
Name Minimum Life Span Maximum Life Span
Abyssinian14.015.0
Aegean9.012.0
American Bobtail11.015.0
American Curl12.016.0
American Shorthair15.017.0

To start creating a violin plot, we need to import two libraries: DataFrame, which is a frame where our chart will be drawn, and Violin Plot, which allows us to create a violin plot. Then we will initialize a DataFrame and load a comma separated values to the frame.

Now that we have the violin plot created, let us load the data from the cats dataset and display it.

The steps that we describe in this task can be followed in the 'Loading and Formatting' section that is available in our violin plots tutorial.

Currently, we have only loaded the comma separated value file into the dataframe, and we are not reading anything from it. The first thing that we need to do is to load the columns to be read by our bar chart to create the x-axis. In this example, we added 'Minimum Life Span' and 'Maximum Life Span.' We don't need to add anything for the y-axis, since it's just the associated values for maximum and minimum life span.

Finally, we can create our violin plot object using our frame with the following line of code:

``ViolinPlot chart = frame:ViolinPlot()``

Right now, we have a simple violin plot, but there is not much to it. We will proceed on organizing our box plot with labels and further customization methods in the next section.

Labeling and Customizing the Violin Plots (5 minutes)

The next step is to be able to add specific labels to the entire violin plot, both axes, adding subtitles, and changing the font size of the print. Adding these features will allow us to present our data in a clearer way.

The steps that we describe in this task can be followed in the 'Labeling the Violin Plot' section that is available in our violin plots tutorial.

First, we will add labels for our chart, x-axis, and y-axis. This allows the reader to distinguish between the information and to understand what data they are looking at. For our chart we can label it the following way: title - 'Minimum and Maximum Life Expectancy for Various Cat Breeds'; x-axis - 'Min/Max Lifespan'; y-axis - 'Type of Extrema.' If we feel like those labels are not enough, we can add a subtitle to the histogram. For example, for this histogram the subtitle can be 'How long do cats live?'

Like box plots, violin plots have some unique parts that may be customized due to their uniqueness as a chart. One of the unique features that we can explore here is 'flipping' the orientation of the data. Other features we can add to further customize our violin plot is that we can modify the font size, the location of our legend, and the color palette. Many of these features are available across all charts in Quorum. We can reference the section 'Customizing the Data Chart' for more information in the violin plots tutorial.

Accessing the Violin Plots (5 minutes)

To view our actual plot we can type the following line of code:

``chart:Display()``

This displays our chart via pop-up box inside of Quorum or on the side view of the Quorum online editor! Now let us explore the graphic using the accessibility tools on our devices. When the violin plot has been created, we should see our chart pop up in a separate window. From there, we can reference this tutorial on keyboard navigation using the arrow keys for accessibility. One note is that when a bar chart is saved onto our computer devices, they save as an SVG (scalable vector graphic) which maintains not only resizability but accessibility to read these charts using a screen reader.

Break-Out Group Discussion (5 Minutes)

We can use this time to find datasets that would be a best fit for using a box plot and violin plot, or simply talk about real-world uses for each chart. Since violin plots are more uncommon, we can discuss why they are but also note the benefits of using a violin plot versus a box plot.

Relevant Common Core Standards

We use the following website for common core standards in relation to histograms and measurements of distribution.

CCSS.MATH.CONTENT.6.SP.B.4: Display numerical data in plots on a number line, including dot plots, histograms, and box plots.

CCSS.MATH.CONTENT.6.SP.B.5: Summarize numerical data sets in relation to their context, such as by:

CCSS.MATH.CONTENT.6.SP.B.5.A: Reporting the number of observations.

CCSS.MATH.CONTENT.6.SP.B.5.B: Describing the nature of the attribute under investigation, including how it was measured and its units of measurement.

CCSS.MATH.CONTENT.6.SP.B.5.C: Giving quantitative measures of center (median and/or mean) and variability (interquartile range and/or mean absolute deviation), as well as describing any overall pattern and any striking deviations from the overall pattern with reference to the context in which the data were gathered.

CCSS.MATH.CONTENT.6.SP.B.5.D: Relating the choice of measures of center and variability to the shape of the data distribution and the context in which the data were gathered.

Next Tutorial

In the next tutorial, we will discuss Regression, which describes Understanding the Regression model using Quorum..