Loading and Formatting our Dataset
This tutorial shows us how to properly load our dataset using DataFramesFormatting and Loading the Dry Bean Data
Before we run analysis on this data, let us examine it. This dry bean dataset comes from seven different classes of beans: Seker, Barbunya, Bombay, Cali, Horoz, Sira, and Dermason. From these beans, researchers collected data on 12 dimensions and 4 shape forms in order for systems to distinguish between seven different registered beans. The attributes recorded from the beans are: Area, Perimeter, Major axis length, Minor axis length, Aspect ratio, Eccentricity, Convex area, Equivalent diameter, Extent, Solidity, Roundness, Compactness, ShapeFactor1, ShapeFactor2, ShapeFactor3, ShapeFactor4, and Class. In terms of classification, these attributes are important because they help scientists understand the natural world. Here is a sample of our dataset:
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | 
|---|---|---|---|---|
| 28395 | 610.291 | 208.1781 | 173.8887 | 1.197191 | 
| 28734 | 638.018 | 200.5248 | 182.7344 | 1.097356 | 
| 29380 | 624.11 | 212.8261 | 175.9311 | 1.209713 | 
| 30008 | 645.884 | 210.558 | 182.5165 | 1.153638 | 
Oftentimes, we need to transform our data before we can use it and here is no exception. From the UCI repository, the dry bean dataset can be downloaded as a zip file, and within the zip file, the dataset itself, it is saved as an XLSX, however Quorum only accepts datasets that are of CSV type. Recall that a CSV is a plain text file that contains data separated by commas. To fix this issue, we need to save the file in any spreadsheet program (e.g., Excel, Google Sheets). From there, we simply would want to keep the file name relating to the dry beans and be sure to save as 'CSV (comma delimited) *.csv.' Now that we have the correct file format, we can use this dataset to analyze.
For easier access, we have the converted dataset readily available here
Now that our dataset is prepared, we are going to load it into a DataFrame. A DataFrame is the object that will hold in our data. To do this, we create the DataFrame and call it 'frame.' From there we use 'frame' like a regular variable and now have access to many functions on loading, formatting, etc. and for this purpose, we will be using the Load(text filepath) function and AddSelectedColumns(text heading). Here is a brief explanation of what our Load(text filepath) and our AddSelectedColumns(text heading) functions do.
| Class / Action | Description | Usage | 
|---|---|---|
| frame:Load(text filepath) | This action takes in a string which is the location of the file path (inside the file explorer) of the dataset we want to read in. | frame:Load("../Data/Food/FastFoodRestaurants.CSV") | 
| frame:AddSelectedColumns(text heading) | This action takes in a string which is an exact text header within the dataset. | frame:AddSelectedColumns("Heading 1") | 
| frame:AddSelectedColumn(integer colNum) | Alternatively, we can add the corresponding integer column in the dataset. This function takes in an integer which is the column number starting from 0. | frame:AddSelectedColumn(0) frame:AddSelectedColumn(1) | 
Now that we have this 'frame' object, we can call the Load function which will locate the file and read all data entries in. To confirm that our data has been read, we can output our CSV file using the function call, ToText(). If we wanted to only output a specified column (because datasets can become quite large), we can also call the function ToText(integer) that allows us to print out the row based on the integer inside the parameter. This will only print one row at a time rather than the entire dataset.
Luckily, for this dataset, there is no missing data, however as we have seen in the data science world, some datasets are messy and would require preprocessing. Typically after loading in a dataset, we would want to clean it, meaning, getting rid of empty rows or columns. In Quorum, this is how we would properly read in our data:
//We need the DataFrame class to load in files for Data Science operations.
use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Transforms.RemoveUndefinedRowsTransform
//Create a DataFrame, which is essentially a table that understands 
//more information about the data that is being loaded.
DataFrame frame
//This loads data relative to the project, so put the dryBeans file in the Data/Miscellaneous folder
frame:Load("../Data/Miscellaneous/DryBeans.csv")
//We can save the frame or output it to the console, like we are doing here.
output frame:ToText()
Try it Yourself!
Press the blue run button to execute the code in the code editor. Press the red stop button to end the program. Your program will work when the console outputs "Build Successful!"
Next Tutorial
In the next tutorial, we will discuss descriptive statistics, which describes descriptive statisics and how to make them accessible.