Removing Undefined Values
When cleaning a data set, it is tempting to open excel, or some other spreadsheet program, and to manually adjust a file. For many kinds of data, this is both normal and ok. For scientific data, however, we often want to create a record of exactly what we did. Otherwise, other scientists cannot reproduce our steps.
As such, let us first write a computer program that loads our data, then removes any rows that have undefined values in them. This is not always the operation we want. Removing rows is acceptable only if it makes sense given the research questions we are trying to answer. For example, if we are trying to answer questions about a particular country in our data set, but we remove that row, then this would not make sense.
In other cases, removing rows is fine. Consider if we were looking at census data, where millions of people put in their information. In such a case, while removing rows where a person did not fill out certain information could distort conclusions, when looking at millions of people, this removal might be okay. Again, it depends on what questions we are trying to answer and whether our assumptions are fair.
In our case, because many individuals have not filled out headings such as 'Other monetary comp' or 'Currency - other,' they may be less meaningful to our dataset as a whole so it may be okay to remove them. By removing such headings, our data can be less cluttered with columns that have empty entries.
To accomplish this, we use the RemoveUndefinedRowsTransform class, which does the work for us. This class first makes a copy of our DataFrame, so it does not destroy anything, then returns us the copy. We use it by passing the transform to the DataFrame. We also have a brief description of how to use these functions for our dataset.
|Class / Action||Description||Usage|
|RemoveUndefinedRowsTransform transformedObject||This class will find any rows that are left blank and will remove them from our dataset. We will have to declare our object and then create a new DataFrame to save this changed data. From this we use our original DataFrame to clean the dataset. Note, this is only the object created.||RemoveUndefinedRowsTransform transform|
|DataFrame:Transform(RemoveUndefinedRowsTransform transform)||This Transform function belongs to the DataFrame class and takes in a RemoveUndefinedRowsTransform object which will help clean our dataset by removing unnecessary rows.||RemoveUndefinedRowsTransform transform DataFrame clean = frame:Transform(transform)|
What we want to do first is load our dataset which we do using our DataFrame object. In this case, we created a DataFrame and called it, 'frame.' Then we want to load in our AskAManager.csv file using the Load(text fileLocation) action within the DataFrame class.
//Create a DataFrame, which is essentially a table that understands //more information about the data that is being loaded. DataFrame frame //This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder frame:Load("../Data/Miscellaneous/AskAManager.csv")
Next, we will need to create a new transformation object, RemoveUndefinedRowsTransform and we will call it transform. We also want to create another DataFrame object to hold our modified 'frame'. This new DataFrame will be called 'clean' and we will be using the Transform function on 'frame' and pass in our RemoveUndefinedRowsTransform object as the parameter.
RemoveUndefinedRowsTransform transform DataFrame clean = frame:Transform(transform)
Below is an example of the full code:
//We need the DataFrame class to load in files for Data Science operations. use Libraries.Compute.Statistics.DataFrame use Libraries.Compute.Statistics.Transforms.RemoveUndefinedRowsTransform //Create a DataFrame, which is essentially a table that understands //more information about the data that is being loaded. DataFrame frame //This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder frame:Load("../Data/Miscellaneous/AskAManager.csv") //This class transforms data by removing any rows that contain undefined //values. It is now always what we want, but can be useful RemoveUndefinedRowsTransform transform DataFrame clean = frame:Transform(transform) //We can save the frame or output it to the console, like we are doing here. output clean:ToText()
Run the Example
Removing undefined values for data transformations
To view the program we have made, we can download the program file
In the next tutorial, we will discuss replace undefined values , which describes replace undefined values.