Data Science

Scatterplots and Correlations

Understanding scatter plots, correlation and R^2

Learning Objectives

A scatter plot uses dots to represent values for two different numeric values. The position of each dot on the x-axis and y-axis indicates values for each data point. Scatter plots are used to observe relationships between variables.

The student will learn:

Students will be able to create and customize a scatter plot from a dataset.
Students will be able to access the information from the scatter plot using assistive technology.
Students will understand how to interpret correlation and R-squared.

Creating a Scatter plot using Quorum Studio (15 minutes)

To create a scatter plot, we will begin by creating a DataFrame. In order to do this, we will need to first download the Dogs dataset (as a CSV file). We will then need to add the Factors and Columns before running the program to display the scatter plot. In this lesson we can follow the tutorial Scatter Plot.

The Dogs dataset describes the various traits of different dog breeds. It has 12 columns and 106 rows, but we will not be using all of the columns in this example. We will be using 3 columns: "Maximum Weight", "Maximum Life Span", and "Breed ". Here is a snippet of what the dataset should look like:

Dog CSV
Breeding Group	Maximum Life Span	Maximum Weight
Toy	12	13
Hound	13	60
Terrier	13	65
Working	12	120
Working	14	115

These steps that we describe in this task can be followed in the 'Loading and Formatting' section in our Scatter Plot tutorial.

To start us off we will download the Dogs dataset, so that we can have the means to create a scatter plot. You can follow the link to download the Dogs dataset. If you unsure on how to download the dataset, here is a link to our tutorial, Downloading CSVs for Our Charts

To start creating a scatter plot, we need to import two libraries. DataFrame, which is a frame where our chart will be drawn, and ScatterPlot, which allows us to create a scatter plot. Then we will initialize a DataFrame and load a comma separated values to the frame.

Currently, we have only loaded the comma separated value file into the dataframe, and we are not reading anything from it. The first thing that we need to do is to load the columns and factors to be read by our scatter plot. Specifically, we will add the three columns that we will be using in this tutorial: "Maximum Weight", "Maximum Life Span", and "Breed ". After we have loaded factors and columns, we need to create a scatter plot object using the frame that we have filled. Finally, we can display the content of that scatter plot. Next, we will be labeling and customizing our scatter plot.

Labeling Scatter Plots (10 minutes)

The next step is to be able to add specific labels to the entire scatter plot, both axes, adding subtitles, and changing the font size of the print. Adding these features will allow us to present our data in a clearer way. The steps that we describe in this task can be followed in the 'Labeling the Scatter Plot' section that is available in our Scatter Plot tutorial.

We will add labels for our chart, x-axis, and y-axis. This allows the reader to distinguish between the information and to understand what data they are looking at. For our chart we can label it the following way: title - "Dog Weight and Life Span"; x-axis - "Maximum Life Span (years)"; y-axis - "Maximum Weight (pounds)". If you feel like those labels are not enough, you can add a subtitle to the scatter plot. For example, for this scatter plot the subtitle cab be "Does weight correlate to life span for dogs?"

Accessing Scatter Plots (10 minutes)

Now let us explore the graphic using the accessibility tools on our devices. When the scatter plot has been created, we should see our chart pop up in a separate window. From there, we can reference this tutorial on keyboard navigation using the arrow keys for accessibility. One note is that when a scatter plot is saved onto our computer devices, they save as an SVG (scalable vector graphic) which maintains not only resizability but accessibility to read these charts using a screen reader.

Correlation (R) (10 minutes)

With scatter plots, you are often investigating the relationship between variables. Since how related variables are is an important factor for graphing, there are a couple of terms scientists use for thinking about them.

Firstly: correlation. Correlation is when variables are related to each other. It's important to note that being related to each other doesn't necessarily mean one causes the other. For example, two people getting a storm alert warning may be correlated (since they might live in the same area impacted by the storm), but one of them getting that warning does not cause the other to get it. Correlation is important because, even if factors don't cause each other, they may both be caused by a third factor (in this case, the storm). If we understand which factors influence each other, then we can better understand how everything interacts.

Causation is when factors do cause each other, such as watering a plant causing it to grow. To prove causation, we have to actually do further experiments to prove it, since we can't tell that relationship from just a scatter plot.

Correlation can also be strong or weak. Strong correlation is when the data points are closer to the regression line, while weak correlation means they're a bit more spread out. A positive correlation is when the increase of the independent variable means an increase in the dependent variable, so like an increase in age is related to an increase in height. The opposite is called a negative correlation, such as an increase in heat leads to a decrease in ice.

Scatter Plot with a Positive Correlation

In this example, we show a scatter plot regarding a grocery store's profits and how much they sold based on category. We can see a positive trend that the more items are sold, the increase of profit the grocery store.

ScatterPlot with a Negative Correlation

In this example, we show a scatter plot regarding the time it takes for trips based on overall car speed. We can see a negative trend that the faster the car goes, the quicker the trip is. Please note we do advise drivers to go the speed limit!

R-Squared (15 minutes)

Related to correlation is the term R-Squared. R-Squared is how much the independent variable can explain a dependent variable. So, if we looked at something like age and height, it would be how much the changes in age can explain changes in height. While age and height are correlated, your height is also impacted by other factors, like how tall your parents are, and so not all height variation can be explained by age. However, if I had a more theoretical model where y = 2x, then y could be fully explained by what x is.

R-Squared is between 0 and 1, with 0 meaning the independent variable explains none of the dependent variables change, and 1 meaning it explains all of it. Since most real-world data usually has multiple influences on each point, it's unlikely R-Squared would actually be 1.

It is possible to use regression lines, like the one we drew earlier, to actually calculate R-Squared, although we won't be getting into that today. However, we will be looking at roughly different strengths of R-Squared.

If we look at our current chart from before, we can tell there's a negative medium correlation between Maximum Weight and Maximum Life Span. This means R-Squared will be in the middling range. If we add image as a column instead of Maximum Weight, we can tell that there's not really any correlation there, and so R-Squared will be closer to 0. If we try Maximum Height vs Maximum Weight, we can see a stronger correlation, and so R-Squared here would be slightly better than Weight vs Life Span.

Below we will have two other charts, one showing a scatter plot with a low R-Squared value and the other scatter plot with a higher R-Squared value.

Scatter Plot with a Higher R-Squared

Notice these R-Squared values. It is unlikely that they will be greater than 1 but they are reaching the value of 1. They also present a more positive correlation.

Scatter Plot with a Lower R-Squared

Regarding this plot, we see that the R-Squared value is very low, almost closer to 0. The trend for this plot is almost a negative correlation. We can discuss the differences of each chart and try to analyze why it is important in Data Science.

Relevant Common Core Standards

We use the following website for common core standards in relation to histograms and measurements of distribution.CCSS.MATH.CONTENT.HSS.ID.B.6: Represent data on two quantitative variables on a scatter plot, and describe how the variables are related.CCSS.MATH.CONTENT.HSS.ID.B.6.C: Fit a linear function for a scatter plot that suggests a linear association.CCSS.MATH.CONTENT.HSS.ID.C.7: Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.

Next Tutorial

In the next tutorial, we will discuss Multi-Charts, Colors, and Options, which describes Understanding customization of individual charts using Quorum..

Go Back Next Tutorial