What are transformations?
In this tutorial, we are going to examine some messy real-world data, clean it up, and then shuffle it around. We often do adjustments like this in data science because data we receive from nature, partners, or just that we made ourselves, is now always in a format we later find convenient. For this tutorial, we are going to use thefollowing data set:Salary Survey
We can click this link to download the file for the CSV.
This dataset comes from a website called AskAManager.org and collects live data responses from users about the industry they work in, salary, workplace location, etc. We want to thank Alison, the owner of the AskAManager survey for allowing us to use this dataset for these upcoming tutorials.
The reason we are looking at this dataset is because like any response form converted into a dataset, not all parts will be filled out, thus creating messy data. As we can see with this dataset, many individuals do not fill out all sections creating missing data such as in the sections of 'Additional context of job title' and 'Other monetary comp.' To follow along and work on transforming this dataset, we can download the dataset here.
Here is a snippet of the AskAManager.csv file that we will be using:
|How old are you?
|Education (Higher Education)
|Research and Instruction Librarian
|Computing or Tech
|Change & Internal Communications Manager
|Accounting, Banking & Finance
Let's go in depth into why there are problems in this dataset. Note that some of the rows contain little information. In the heading of 'Other monetary comp' we can see that some users optionally put in data where users would enter in their information, leave in 0, or leave it blank. Another problem is that there are naming convention inconsistencies within the 'Country' For example, for the United States, users have entered in US, USA, United States, United States of America, etc. If this data would be graphed based on countries, all of these entities would have different points even though they are the same location. Overall, many of the headings where it asks about additional details are left blank, leaving large gaps. With these thoughts on the dataset in mind, and we might want to transform it,
//We need the DataFrame class to load in files for Data Science operations.
//Create a DataFrame, which is essentially a table that understands
//more information about the data that is being loaded.
//This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder
//The system loaded the file, but can also output it a text value, or the console, if we want that.
Try it Yourself!
Press the blue run button to execute the code in the code editor. Press the red stop button to end the program. Your program will work when the console outputs "Build Successful!"
In the next tutorial, we will discuss removing undefined values , which describes removing undefined values.