Replacing Undefined Values
In addition to removing undefined values, sometimes we want to replace them with an alternative. For example, if we had a DataFrame with integers, and a few were undefined, then we might want to replace them with zero. While we should always use caution when replacing data in a data set, as a practical matter, sometimes replacing undefined values is desirable in the field of data science.
To do this, there are two actions we can use, ReplaceUndefined(text) and ReplaceUndefinedFromSelectedColumns(text). They both do essentially the same operation, which is to ask the column to replace any undefined values with a value represented via the text.
|Class / Action||Description||Usage|
|dataFrameObject:ReplaceUndefined(text replacedElement)||This action will replace any empty element within the dataset with the text value passed across the whole DataFrame.||frame:ReplaceUndefined("0")|
|dataFrameObject:ReplaceUndefinedSelectedColumns(text replacedElement)||This action will replace any empty element within the dataset with the text value passed given a specified numerical or non-numerical column.||frame:ReplaceUndefinedSelectedColumns("0")|
For example, suppose we have an integer column and we send it the text value of 'Dog.' In such a case, this is not a valid integer. With invalid values, any column that the text value is not related to automatically ignores the request. If, however, the value is valid for a particular column, like it would be in a text column, then the undefined values are all replaced. For the actions, the difference between them is that one does the replacement across the entire DataFrame, while the other does it for selected columns only.
The reason why we want to replace bad data values is to increase the usability of a dataset. Typically, in data science, columns with values that are all of the same type and filed in are typically more reliable. Why this is important is because when we need to summarize our data (whether it be through a chart or using some form of descriptive statistics), we would not want the risk of any corrupted data (data that is unable to be calculated by any actions we use to manipulate our DataFrame) or in a case where major outliers are present.
In this example, we are going to take a look at the 'Other monetary comp' column within the AskAManager.csv. Here is a snippet of this column from our dataset.
|Other monetary comp|
As we can see here within the first few rows, some data entries are missing, and we are going to clean that up. We will be calling the action ReplaceUndefinedSelectedColumns(text) and replace any empty boxes with the number '0' so we have a more complete dataset.
We would first want to read in our file, AskAManager.csv using the Load(text file) action and pass in the file location of our CSV file. In this case, the file is located within the main 'Data' folder in a subfolder called 'Miscellaneous.'
DataFrame frame frame:Load("../Data/Miscellaneous/AskAManager.csv")
Next, we want to select the column number of the column we want to grab; note that columns start off from 0 which means the 'Other monetary comp' would be column 6. Finally we would want to call our action, ReplaceUndefinedFromSelectedColumns(text replacedElement) and pass the value we want to replace any undefined values as. This means any blanks in our dataset would be replaced with the number 0 in its place for that column.
Below is the full code example:
use Libraries.Compute.Statistics.DataFrame DataFrame frame frame:Load("../Data/Miscellaneous/AskAManager.csv") //Add the first column in the set to the list of selected columns frame:AddSelectedColumn(6) //perhaps this replaces an undefined in an IntegerColumn or a TextColumn frame:ReplaceUndefinedFromSelectedColumns("0") output frame:ToText()
Run the example
Replace undefined values for data transformations
To view the program we have made, we can download the program file
In the next tutorial, we will discuss filtering rows, which describes filtering rows.