Data Science

Regression

Understanding regression to determine fitted data in Quorum

Learning Objectives

Regression is basically creating a well-fitting line for data. Usually when people talk about regression, they're talking about linear regression. Linear regression is finding the best fitting straight line, so a line in the format y = mx + c. In 2D cases, we could use this line to predict y using x. However, regression can be used for as many dimensions as we want, drawing that line to allow us to predict a single factor based on multiple others.

The student will learn:

Students will be able to understand the concept of regression and what it is used for.
Students can load data and calculate a regression with variables of their choosing.
Students can interpret the output of a regression.

While no documentation is yet available online, Quorum 10 has a new experimental system for calculating regression, which we will outline in this session. All of us should note that this system is quite new and is still in testing, meaning that the way we use this library will likely change in the future. We highlight it here because regression is a common topic in high-school and this shows how to write code to accomplish it.

Load a DataFrame (10 minutes)

First, we want to make sure we load in all the libraries we're going to need for this session. We're going to be using both Dataframe and Regression, so we should load both of those in.

use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Tests.Regression

To start us off we will download the dataset, so that we can have the means to create a bar chart. You can follow the link to download ice cream sharks dataset. If you are lost on how to download the dataset from GitHub, here is a link to our tutorial, downloading CSVs for our charts.

Now, we can load it in as our DataFrame:

DataFrame frame
frame:Load('../Data/Science/GoogleTrendsIceCreamSharks.csv')

output frame:ToText()

Setup Columns for the Regression (5 minutes)

Just like with charts, we use a regression by telling it which columns are the variables we are interested in. For example, if we want the dependent variable to be the variable named 'y' and the others to be independent variables, the code would look as so:

frame:AddSelectedColumns('shark attacks')
frame:AddSelectedFactors('ice cream')
frame:AddSelectedFactors('pools')

In this case, we're seeing how much pools and ice cream relate to shark attacks.

Calculate the Regression (15 minutes)

Once we have set up the columns, we now calculate the regression information. Regression is itself a bit esoteric and is not particularly easy to understand. We start by feeding in the data to calculate a regression line (a straight line that minimizes the distance to each point):

//now calculate a Regression
Regression result = frame:RegressionOnSelected()

There's also lots of relevant data that we can get from a regression line, which you can load in like this:

//provide information in a format often used in academia
output result:GetFormalSummary()

output 'Beta: ' + result:GetCoefficients():ToText()
output 'Residuals: ' + result:GetResiduals():ToText()
output 'Residual Sum of Squared: ' + result:GetResidualSumOfSquares()
output 'Total Sum of Squared: ' + result:GetTotalSumOfSquares()
output 'F ' + result:GetCriticalValue()
output 'p = ' + result:GetProbabilityValue()
output 'R^2: ' + result:GetEffectSize()

There's obviously a lot going on here, so we're going to go through this code in the context of what it means for this data.

Interpreting the Output (20 minutes)

Beta: |0.0000158, 0.0948, -0.0457|

The first is 'Beta.' Each factor is going to have a different amount of impact on the dependent variable. For this case, we can see that the amount of searches for ice cream (second value) has a larger impact on shark searches than the amount of searches for pools (third value). The first value is an added constant to try to take into account any factors we missed, and since it's fairly low we can tell that when people aren't searching for ice cream or pools, it's unlikely they'll search for sharks.

Residuals: |2.43, 2.11, ...

Points do not usually fall exactly on our regression line. The residual is the difference between the predicted value and the actual value measured. So in this case, if we predicted that 5 ice cream searches and 10 pool searches meant there would be 15 shark searches, but there were only 7, the residual would be 8. We want residuals to be as low as possible, since lower residuals mean we predicted more accurately.

Residual Sum of Squared: 1128
Total Sum of Squared: 1546

The two lines about squaring refer to the squared distance between the points and the regression line, and the points and a straight line drawn at their average. These mainly show us how much the total residual amount (squared) has decreased when we added an actual regression line, and later we'll use this information to calculate a correlation.

F 47.7

It's important for us to know if any of the factors in our model are even related to the variable we want to predict. For this dataset, ice cream and sharks seem like they would be unrelated, so we should check the general relationship between everything, and that's where F comes in. If F is below our chosen significance level, then the factors are related.

p = .001

The overall p value tells us the probability that our factors are related beyond chance. It is a statistical measurement that is used to validate or reject a hypothesis according to the collected data. In simpler terms, p-value measures the probability of obtaining the numbers that we have collected by research. Since p-value is a probability we need to understand that it can only be in the range from 0 up to 1, or in other words from 0% to 100%. In our case, a low p-value (less than or equal 0.05) indicates that the data is statistically significant. And the greater p-value (greater 0.05) means that the data is not statistically significant. Statistical significance means that we can connect the data to a specific cause that we are testing.

R^2: 0.270

R-Squared is often termed a 'variance accounted for' measure. In a sense, it provides us some evidence of how much variance, the wobble in our data, is contributing to the predictiveness of our model. So for the shark vs ice cream dataset, it would be how much the changes in searches for ice cream and pools can explain changes in shark searches. While these are correlated, shark searches are also impacted by other factors, like the time of year.

This data also shows how correlation is not the same as causation. Although the various measurements show that Google searches for pools and ice cream are correlated to searches for sharks, why is not explained by these numbers. So ice cream and pool searches are correlated to shark searches, but we can imagine more likely explanations, like the time of year the search was made (e.g., in the summer).

Another Example (10 minutes)

Next we're going to take some time to try doing this with another dataset. First, download the Christmas Spending CSV. Next, we're going to do regression using year and Christmas spending in the US to predict Christmas spending in the UK.

If we run the same regression calculations on this dataset, we should notice some changes. Here are some questions to keep in mind as we go through this example:

Are the variables actually related? How can we tell?
How does this model do compared to the previous example?

//provide information in a format often used in academia
output result:GetFormalSummary()

output 'Beta: ' + result:GetCoefficients():ToText()
output 'Residuals: ' + result:GetResiduals():ToText()
output 'Residual Sum of Squared: ' + result:GetResidualSumOfSquares()
output 'Total Sum of Squared: ' + result:GetTotalSumOfSquares()
output 'F ' + result:GetCriticalValue()
output 'p = ' + result:GetProbabilityValue()
output 'R^2: ' + result:GetEffectSize()

We use the following website for common core standards in relation to histograms and measurements of distribution.CCSS.MATH.CONTENT.8.SP.A.1: Construct and interpret scatter plots for bivariate measurement data to investigate patterns of association between two quantities. Describe patterns such as clustering, outliers, positive or negative association, linear association, and nonlinear association.CCSS.MATH.CONTENT.8.SP.A.2: Know that straight lines are widely used to model relationships between two quantitative variables. For scatter plots that suggest a linear association, informally fit a straight line, and informally assess the model fit by judging the closeness of the data points to the line.CCSS.MATH.CONTENT.8.SP.A.3: Use the equation of a linear model to solve problems in the context of bivariate measurement data, interpreting the slope and intercept. For example, in a linear model for a biology experiment, interpret a slope of 1.5 cm/hr as meaning that an additional hour of sunlight each day is associated with an additional 1.5 cm in mature plant height.

Relevant Common Core Standards

Next Tutorial

In the next tutorial, we will discuss Exporting SVGs, which describes Learn how to export and navigate through SVGs..

Go Back Next Tutorial