In addition to removing or replacing undefined values, we also sometimes want to remove rows that follow a particular pattern. Consider, for example, the following DataFrame:
If we had such a frame, we might want to remove certain rows inside it. For example, we might want to remove any rows in the y column that are greater than 6. In real-world examples, while we could do this by hand, for large data sets this can get clunky and error prone. Again, if this data was in a spreadsheet, we could select the two offending columns and delete them. In this section, in order to make the operations on our data more reproducible, we are going to automate this.
Filtering is a tricky problem and this system uses an action in Quorum's DataFrame system called Filter(text expression). By default, filter takes in a special form of text, an expression related to our DataFrame, and then processes the entire file row by row. Here is a brief description of what Filter(text expression) does:
|Action / Function||Description||Usage|
|dataFrameObject:Filter(text expression)||This action takes in an expression labeled as text and uses the expression to remove rows based on that expression. This is such as only wanting to keep values pertaining to a certain range within a numerical column.||frame:Filter("x < 5")|
Let's take an example. We will be using a simple dataset to show this process of how to filter rows. Here is a snippet of this dataset (unrelated to our AskAManager.csv dataset):
We can download this sample dataset, random.csv by clicking this link.
Here we would like to take a look at column 'c' and want to filter out any rows that are less than the value 35. In order to accomplish this, we would call our Filter action within the DataFrame's class and insert 'c < 35' inside the parameter. This will go through the entire column and remove the entire row for any rows that contain a value less than 35 within the 'c' column. Here is an example of this:
//We need the DataFrame class to load in files for Data Science operations. use Libraries.Compute.Statistics.DataFrame use Libraries.Compute.Statistics.Transforms.RemoveUndefinedRowsTransform //Create a DataFrame, which is essentially a table that understands //more information about the data that is being loaded. DataFrame frame DataFrame copy //This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder frame:Load("../Data/Miscellaneous/random.csv") // save the original frame copy = frame:Copy() //This action filters any rows less than 35 in specifically the c column. frame:Filter("c < 35") output "original frame" output copy:ToText() //We can save the frame or output it to the console, like we are doing here. output "modified frame" output frame:ToText()
What this code is doing is complicated. On the surface, we are telling the DataFrame to process our expression, c < 35, and convert it into what is in reality a separate small computer program. To highlight our process we first loaded our file, random.csv and read it into our DataFrame, 'frame.' Because we are eventually going to manipulate our 'frame' we want to save the original DataFrame so we create a new DataFrame called 'copy.' We simply call 'frame:Copy()' and save it into the 'copy' object which will give us the exact same frame.
//Create a DataFrame, which is essentially a table that understands //more information about the data that is being loaded. DataFrame frame DataFrame copy //This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder frame:Load("../Data/Miscellaneous/random.csv") // save the original frame copy = frame:Copy()
Next, we call Filter() on frame and insert the equation we have (y < 35) as the parameter for Filter(). Finally, what we are doing is outputting both the original frame (copy) and the newly modified frame (frame). As we can see, our 'frame' object has been filtered by removing any row in column y with a value less than 35.
frame:Filter("c < 35")
Just like in normal Quorum programs, the syntax and semantics of these filters is the same. If we wanted to write c < 35 + 1, or many other kinds of expressions, we could, although we point out that the variety of programs we can write in the filter is intentionally limited. Thus, while the code we write matches normal code in the language, not all possible expressions are allowed.
One last point is that in this expression, the value c represents a column, yet the program is filtering the entire DataFrame. This is because this expression is tested on the c column, re-evaluated for every row. If a row evaluates to true, then the row is retained. If the row calculation results in false, then the row is removed.
Run the Example
Filtering rows for data transformations
To view the program we have made, we can download the program file
In the next tutorial, we will discuss calculating columns (mutate), which describes calculating columns (mutate).