Data - Lesson 3: Filtering and Cleaning Data
Overview
In this lesson, student explore the challenges of working with a messy dataset. First students learn how to identify issues using Quorum and then manually clean the data. Following this, students learn about the filtering tools and use a guided activity to answer data questions that require filtering a dataset.
Goals
Students will be able to:
- Explain why data needs to be cleaned
- Use Quorum to filter data
- Create filtered charts that answer specific questions
Purpose
The goal of this lesson is to introduce two crucial concepts when working with data: cleaning and filtering. The datasets provided by code.org are generally clean, so a 'messy dataset' is included for students to explore. After discovering why datsets need to be cleaned, students manually clean the dataset. Because we have included the dataset in Comma Separated Value format (CSV), it can be loaded in any software that reads these files, including Excel.
When working with a dataset to answer a question, the user may want to focus on a subset of the data. In this lesson, students are introduced to the filtering tools in Quorum so they can accurately filter for the information they are looking for. This teaches filtering in a real-world context while programming.
Resources
Preparation
Try filtering a DataFrame using the sample in Lesson3_App1.
Getting Started (2 minutes)
Remarks
- We've started to explore how to use charts to process data stored in a table, but there are challenges with doing this. The ability to process that data depends on the users and available tools. Today, we are going to explore ways to refine the data the we can answer even more questions. We are going to do this through programming, which will give us a taste of how this kind of data analysis is done in the real-world.
Display: Set the scene for the first activity.
- Imagine you have used a survey to collect information from students. This is aligned with the first step in the Data Analysis Process. All of that data is now stored in a table. You are excited to dig into the data and see what you can learn. Let's go!
Activity (33 mins)
Remarks
When processing data, every tool we use throughout our careers is different in how we load, filter, and chart data. In Quorum, we create charts by creating a DataFrame and then loading it. After loading, we then tell the frame which columns we are interested in charting. The app Lesson3_App1 provides sample code for loading a bar chart.
In this first example (Lesson3_App1), notice that if a value is invalid in the CSV that Quorum automatically removes the value. We should always inspect data sets in areas like data science to ensure we know exactly what each tool we are using does. Every tool makes decisions and these can vary.
Do This: Create charts in Quorum Studio
- Open Lesson3_App1 in Quorum Studio
- Open the file Unit9/Data/Other/Student Info.csv in any program that reads Comma Separated Values files. We recommend Excel.
- Change the code in Lesson3_App1 to generate charts or histograms related to the columns Average Hours of Sleep or Favorite Subject
- Explore the charts
Discuss: With partners, students discuss the following prompts:
- What problems came up when trying to create these charts?
- What problems do you see in the data?
- In code.org's original materials, they used a visualizer instead of writing code for the lesson. What are the pros and cons of this approach versus writing code for filtering and displaying data?
Datasets can bring about challenges, no matter what their size. There can be incomplete data and invalid data. You might want to combine two tables, with inconsistent data. All of this requires data to be cleaned.
Cleaning: When does data need to be cleaned?
- Data is incomplete
- Data is invalid
- Multiple tables combined into one
Messy: What leads to 'messy data?'
- Users enter in different types of data ('two', 2)
- Users use different abbreviations to represent the same information ('February', 'Feb', 'Febr')
- Data may have different spellings ('color', 'colour') or inconsistent capitalization ('spring', 'Spring')
When we clean data, the goal is to make it uniform without changing meaning. For example, the text phrase 'two' can be changed to the number 2.
Do This: With a partner:
- Look for:
- Different types of data ('two', 2)
- Different abbreviations to represent the same information ('February', 'Feb', 'Febr')
- Different spellings ('color', 'colour')
- Inconsistent capitalization ('spring', 'Spring')
- Manually update cells with messy data so they are consistent with other cells, while not changing the meaning of the data.
Do This: Once students have finished cleaning their data, they should remake the original charts and compare with others.
Your Cleaned Data Charts
Your charts should look similar to others, but it depends how you cleaned your data. Now you are able to accurately create charts for the data in your table. For this activity, we cleaned a dataset that is similar to one you might create yourself or have uploaded from another location. However, the datasets in the dataset library have already been cleaned for you.
Discuss:What if I only wanted to look at a subset of my data? How could I do this? For example: I only want to investigate dogs with low shedding.
Discussion Goal
We are building towards the idea of filtering data. Students filtered programmatically in previous units, but don't yet know how to program this easily and without manually iterating.
Remarks
The best way to look at a subset of data is to use a filter. In Unit 5, we filtered data programmatically using traversals in order to gain insight into knowledge from that data. When we program on data sets, an area sometimes called Data Science, our programmatic tools often include methods for automatically filtering without requiring us to iterate. Quorum is no exception and can filter data from DataFrame objects using the Filter action. These tools help us find specific information in the data and look for patterns.
Do This: Demonstrate how to filter data for the class.
- Open up Lesson3_App2 in Quorum Studio
- Discuss that you want to find out about the peak level of female representation in your states legislature. What will you filter by? Percentage of Females in Legislature or State?
- We will filter by state. This means the data for female state legislators who are from your home state are the only data that will be shown.
- Lesson3_App2 has the default of Texas and Percentage of Females in Legislature in the source code. Experiment with other filters and values.
- Discuss what the chart displays: the bar farthest to the right represents the year(s) when female representation in the legislature was at its highest in your states history.
When filtering, the most challenging part is deciding what value you will filter by. Think to yourself: what's the limiting factor? What do I want to make sure all cats, dogs, etc. have to be included in this subset? It's time for you to play around in Quorum Studio yourself!
Distribute: Share the Activity Guide with students. Alternatively, they can click on the link in Level 1. This Activity Guide is designed to be used digitally - do not print.
Do This: For the rest of the activity time, students work through the Activity Guide filtering two datasets and answering questions. Students need to copy/paste their charts into the Activity Guide.
Discuss: Have students share their answers from the Activity Guide and reactions to those findings.
Discussion Goal
Discussion Goal
Use this discussion to help students practice the skill covered in lesson 1, distinguishing between: 1. What the data shows, and 2. Why that might be the case.
Wrap up (10 Minutes)
Discuss: Why is 'Clean and/or Filter' an important part of the Data Analysis Process? What are situations when you would filter vs. clean your data?
Discussion Goal
Goal: Messy data can lead to confusing charts that could be misinterpreted. Data should be filtered when you want to focus on a subset of the data. Filtering allows you to return to the full dataset if you have unanswered questions.
Remarks
Great job filtering and cleaning data today! This is a useful skill that will help you create more powerful computer programs with data. There are other things that programs can to do to better prepare data for, including combining datasets, clustering data, and classifying data. Data analysis is part of a broader area of study called Data Science and we suggest investigating that more to learn how best to understand, use, and understand data.
Assessment: Check for Understanding
For Students
Open a word doc or google doc and copy/paste the following question.
Question
What makes manually cleaning data challenging?
Standards Alignment
- CSTA K-12 Computer Science Standards (2017) 3A-DA-11, 3B-DA-05, 3B-DA-06
- CSP2021: DAT-2.C.1, DAT-2.C.2, DAT-2.C.3, DAT-2.C.4, DAT-2.D.4, DAT-2.E.2, DAT-2.E.3
Next Tutorial
In the next tutorial, we will discuss Code.org Unit 9, which describes explore innovations in everyday life.