Calculating Columns (Mutating)

This tutorial demonstrates how to calculate columns (mutate) in our datasets

Calculate Columns

Another topic of interest in regard to rows is automatically calculating columns, sometimes called mutation. In many programming languages, the idea is that similarly to Filter, we can write an expression and calculate a new column based on that expression. The rules in mutation are identical to those in Filter, except that the result of each row's result becomes the value in the new column. There are two actions we can use for this: AddColumn(text, text) and CalculateColumn(text, text). Here are brief descriptions of both of these actions:

Calculate Columns Helper Functions
Class / ActionDescriptionUsage
dataFrameObject:AddColumn(text columnName, text expression)This action takes in two parameters: the name of the newly added column and an expression in which we would take values from other columns, manipulate them in some way (with our expression) and translate this into the newly added column.frame:AddColumn("new column", "x + y + z")
dataFrameObject:CalculateColumn(text columnName, text expression)This action takes in two parameters: the name of the newly added column and an expression in which we would take values from other columns, manipulate them in some way (with our expression) and translate this into the newly added column. However, this would not be added into the DataFrame.frame:CalculateColumn("new column", "a < 45")

The AddColumn action has two parameters, one for the name of the column to be added and the second for our Filter-like expression. Imagine we had a DataFrame like before:

Example DataFrames
xy
15
26
37
48

If we had an expression like AddColumn("NewColumn", "x + y") then we would then add a new column to our existing DataFrame that is the sum of these two columns. We again will be using the simple numerical dataset, random.csv which can be downloaded here. This is a snippet of our dataset being used.

Example of random.csv file
idabc
14539
257675
31837018
43222048
5101590
6291125

To highlight our process we first loaded our file, random.csv and read it into our DataFrame, 'frame.' Because we are eventually going to manipulate our 'frame' we want to save the original DataFrame so we create a new DataFrame called 'copy.' We simply call 'frame:Copy()' and save it into the 'copy' object which will give us the exact same frame.

``````//Create a DataFrame, which is essentially a table that understands
DataFrame frame
DataFrame copy

//This loads data relative to the project, so put the AskAManager.csv file in the Data/Miscellaneous folder

// save the original frame
copy = frame:Copy()
``````

Next, we call AddColumn() on the frame, and the heading title, 'new' and insert the equation we have (a + b + c) as the second parameter. Finally, what we are doing is outputting both the original frame (copy) and the newly modified frame (frame). As we can see, our 'frame' object has a new column called 'new' and is the sum of the a + b + c columns.

``frame:AddColumn("new", "a + b + c")``

We have the full code example below:

``````frame:AddColumn("new", "a + b + c")

//We need the DataFrame class to load in files for Data Science operations.
use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Transforms.RemoveUndefinedRowsTransform

//Create a DataFrame, which is essentially a table that understands
DataFrame frame
DataFrame copy

//This loads data relative to the project, so put the random.csv file in the Data/Miscellaneous folder

// save the original frame
copy = frame:Copy()

//This creates a new column based on adding the a + b + c columns
frame:AddColumn("new", "a + b + c")

output "original frame"
output copy:ToText()

output "modified frame"
//We can save the frame or output it to the console, like we are doing here.
output frame:ToText()
``````

Like before, this calculation is actually compiling our expression in real-time, once per call to AddColumn, and as such it is slower than other areas of Quorum. Conducting the operation manually through other calls can be faster. However, for many applications, the extra speed increase is nominal and as such, with the exception of very large data sets, this operation has many possible uses. Finally, while AddColumn creates a new column, CalculateColumn runs the script and returns the column, but does not add it to the frame. Either can be used depending on the need.

Try it Yourself!

Press the blue run button to execute the code in the code editor. Press the red stop button to end the program. Your program will work when the console outputs "Build Successful!"

Next Tutorial

In the next tutorial, we will discuss filtering columns, which describes filtering columns.