Overview on transformations

In this tutorial, we are going to examine some messy real-world data, clean it up, and then shuffle it around. We often do adjustments like this in data science because data we receive from nature, partners, or just that we made ourselves, is now always in a format we later find convenient. For this tutorial, we are going to use thefollowing data set:

Salary Survey

We can click this link to download the file for the CSV.

This dataset comes from a website called AskAManager.org and collects live data responses from users about the industry they work in, salary, workplace location, etc. We want to thank Alison, the owner of the AskAManager survey for allowing us to use this dataset for these upcoming tutorials.

The reason we are looking at this dataset is because like any response form converted into a dataset, not all parts will be filled out, thus creating messy data. As we can see with this dataset, many individuals do not fill out all sections creating missing data such as in the sections of 'Additional context of job title' and 'Other monetary comp.' To follow along and work on transforming this dataset, we can download the dataset here.

Here is a snippet of the AskAManager.csv file that we will be using:

Timestamp How old are you? Industry Job title Annual salary
4/27/2021 11:0225-34Education (Higher Education)Research and Instruction Librarian55,000
4/27/2021 11:0225-34Computing or TechChange & Internal Communications Manager54,600
4/27/2021 11:0225-34Accounting, Banking & FinanceMarketing Specialist34,000
4/27/2021 11:0225-34NonprofitsProgram Manager62,000

Let's go in depth into why there are problems in this dataset. Note that some of the rows contain little information. In the heading of 'Other monetary comp' we can see that some users optionally put in data where users would enter in their information, leave in 0, or leave it blank. Another problem is that there are naming convention inconsistencies within the 'Country' For example, for the United States, users have entered in US, USA, United States, United States of America, etc. If this data would be graphed based on countries, all of these entities would have different points even though they are the same location. Overall, many of the headings where it asks about additional details are left blank, leaving large gaps. With these thoughts on the dataset in mind, and we might want to transform it,

//We need the DataFrame class to load in files for Data Science operations.
use Libraries.Compute.Statistics.DataFrame

//Create a DataFrame, which is essentially a table that understands 
//more information about the data that is being loaded.
DataFrame frame

//This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder

//The system loaded the file, but can also output it a text value, or the console, if we want that.
output frame:ToText()

Run the example

Load and output the AskAManager.csv file

To view the program we have made, we can download the program file. We will be using other datasets to highlight some other features of transformations which will be available on those upcoming sections.

Next Tutorial

In the next tutorial, we will discuss removing undefined values , which describes removing undefined values.