In the real world, when finding datasets, many of the times we have to transform or clean our data. Datasets may contain values that are incorrectly formatted or are missing. Columns may need to be added, deleted, combined, or split up. We will be referencing this overview on why we need to filter data. Let us specifically look at the dataset and observe key sections of why this is seen as a messy dataset.
Finally, let us remind ourselves what it means to have Tidy data. The biggest take away is that Quorum uses the Tidy data format to provide a standardized format on how datasets should be structured. Keep in mind that there are good reasons in computer science to not have data be in a Tidy format, especially when large databases are involved, but this format is still very useful for many real-world datasets.
- Students will be able to learn why clean data is important.
- Students will be able to 'filter' data by both rows and columns.
- Students will be able to remove and replace undefined values in a dataset.
Filtering by Rows and Columns (25 Minutes)
Filtering rows and columns help data scientists remove any unnecessary information within a dataset. As an example, let us say we have a dataset of over 10000 entries of data for household income. However, we only want to focus on entries that are between the income brackets of $50,000 to $70,000 which makes a lot of other cells unrelated to our analysis. We can remove all the rows pertaining to entries outside our desired range by filtering.
We will start off with referencing this tutorial on how to filter by rows and discuss the changes made on our DataFrame. Then, we will immediately follow another tutorial on how to filter by columns. Notice the different methods used to filter rows and columns; while filtering columns may seem more complicated than filtering rows, they both accomplish the same idea of transforming our DataFrame.
Replacing and Removing Undefined Values (25 Minutes)
Another method of cleaning up our datasets is to remove or replace undefined values. Undefined values are cells that are blank. This is an example of a portion of our previously used AskAManager dataset, which contains undefined values. We can take time to go over more in this dataset that contains pieces of missing values, but overall, the idea here is to demonstrate that we can adjust these values in our data automatically if we want to.
|How old are you?||What industry do you work in?||Job title||If your job title needs additional context, please clarify here:||What is your annual salary?||How much additional monetary compensation do you get, if any?|
|25-34||Education (Higher Education)||Research and Instruction Librarian||55,000||0|
|25-34||Computing or Tech||Change & Internal Communications Manager||54,600||4000|
|25-34||Accounting, Banking & Finance||Marketing Specialist||34,000|
|25-34||Accounting, Banking & Finance||Accounting Manager||50,000||7000|
To modify our undefined values we will be referencing this tutorial on how to replace undefined values and remove undefined values. There are many other techniques available for cleaning data.
In the next tutorial, we will discuss Column Calculations Activity, which describes Understanding how to filter, split and create columns..