Libraries.Compute.Statistics.Clustering.ClusterByMeans Documentation

This class represents an approach to clustering data, similar to the KMeans++ algorithm. The original code was adapted from Apache Commons Math:https://commons.apache.org/proper/commons-math/download_math.cgi. As a TODO, there are other optimizations and features that are either not included from the original or that probably should be included. First, there are optimizations that exist for KMeans++ to reduce the initialization efforts: http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf. These have not been included. Second, this implementation needs to be made more flexible to include competing strategies for empty clusters and alternative distance computations. Currently, only Euclidian distance is included. This implementation requires that any of the values in the DataFrame must be an integer or a number, or convertable as such, and that no undefined values exist. In either case, an error will be thrown when the algorithm processes the values. It also requires the cluster count to be greater than 0 and the number of clusters must be strictly less than the number of rows.

Example Code

use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Clustering.ClusterByMeans

//make a DataFrame and toss some data in it
DataFrame frame
frame:LoadFromCommaSeparatedValue(
    "X,Y
    1,2
    2,4
    3,6
    4,8
    5,10
    9,18
    10,20
    11,22
    12,24
    13,26"
)
//set the range of points on which we will calculate distance
frame:AddSelectedColumnRange(0,1)

output "Calculating K-means Clustering"
ClusterByMeans means
means:SetClustersSize(3)

//Clusters return an additional column with labels, so they can 
//be included in a chart or other approach, per point
//If we want this in the DataFrame, we need to add it manually
ClusterResult result = means:Cluster(frame)
Array<Cluster> value = result:GetClusters()
IntegerColumn assignments = result:GetClusterIndices()
assignments:SetHeader("Clusters")
frame:AddColumn(assignments) 

//we can also chart the clusters if
//we specify that the new clusters are a factor
frame:AddSelectedFactor(2)
ScatterPlot chart = frame:ScatterPlot()
chart:SetTitle("K-Means Clustering Demo")
chart:Display()

Inherits from: Libraries.Language.Object

Actions Documentation

Cluster(Libraries.Compute.Statistics.DataFrame frame)

This example states to cluster the DataFrame, with the particular selected columns, without taking any factors into account. By default, 3 clusters are selected and this value needs to be modified using SetClusterSize if a different number is desired.

Parameters

Return

Libraries.Compute.Statistics.Clustering.ClusterResult:

Example

use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Clustering.ClusterByMeans
    
//make a DataFrame and toss some data in it
DataFrame frame
frame:LoadFromCommaSeparatedValue(
    "X,Y
    1,2
    2,4
    3,6
    4,8
    5,10
    9,18
    10,20
    11,22
    12,24
    13,26"
)
//set the range of points on which we will calculate distance
frame:AddSelectedColumnRange(0,1)
    
output "Calculating K-means Clustering"
ClusterByMeans means
means:SetClustersSize(3)
    
//Clusters return an additional column with labels, so they can 
//be included in a chart or other approach, per point
//If we want this in the DataFrame, we need to add it manually
ClusterResult result = means:Cluster(frame)
Array<Cluster> value = result:GetClusters()
IntegerColumn assignments = result:GetClusterIndices()
assignments:SetHeader("Clusters")
frame:AddColumn(assignments) 

//we can also chart the clusters if
//we specify that the new clusters are a factor
frame:AddSelectedFactor(2)
ScatterPlot chart = frame:ScatterPlot()
chart:SetTitle("K-Means Clustering Demo")
chart:Display()

Cluster(Libraries.Compute.Statistics.DataFrame frame, integer seed)

This example states to cluster the DataFrame, with the particular selected columns, without taking any factors into account. By default, 3 clusters are selected and this value needs to be modified using SetClusterSize if a different number is desired.

Parameters

Return

Libraries.Compute.Statistics.Clustering.ClusterResult:

Example

use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Clustering.ClusterByMeans
    
//make a DataFrame and toss some data in it
DataFrame frame
frame:LoadFromCommaSeparatedValue(
    "X,Y
    1,2
    2,4
    3,6
    4,8
    5,10
    9,18
    10,20
    11,22
    12,24
    13,26"
)
//set the range of points on which we will calculate distance
frame:AddSelectedColumnRange(0,1)
    
output "Calculating K-means Clustering"
ClusterByMeans means
means:SetClustersSize(3)
    
//Clusters return an additional column with labels, so they can 
//be included in a chart or other approach, per point
//If we want this in the DataFrame, we need to add it manually
ClusterResult result = means:Cluster(frame, 42)
Array<Cluster> value = result:GetClusters()
IntegerColumn assignments = result:GetClusterIndices()
assignments:SetHeader("Clusters")
frame:AddColumn(assignments) 

//we can also chart the clusters if
//we specify that the new clusters are a factor
frame:AddSelectedFactor(2)
ScatterPlot chart = frame:ScatterPlot()
chart:SetTitle("K-Means Clustering Demo")
chart:Display()

Compare(Libraries.Language.Object object)

This action compares two object hash codes and returns an integer. The result is larger if this hash code is larger than the object passed as a parameter, smaller, or equal. In this case, -1 means smaller, 0 means equal, and 1 means larger. This action was changed in Quorum 7 to return an integer, instead of a CompareResult object, because the previous implementation was causing efficiency issues.

Parameters

Return

integer: The Compare result, Smaller, Equal, or Larger.

Example

Object o
Object t
integer result = o:Compare(t) //1 (larger), 0 (equal), or -1 (smaller)

Equals(Libraries.Language.Object object)

This action determines if two objects are equal based on their hash code values.

Parameters

Return

boolean: True if the hash codes are equal and false if they are not equal.

Example

use Libraries.Language.Object
use Libraries.Language.Types.Text
Object o
Text t
boolean result = o:Equals(t)

GetClustersSize()

This returns the number of clusters expected when the algorithm has finished. The default is 3.

Return

integer:

GetHashCode()

This action gets the hash code for an object.

Return

integer: The integer hash code of the object.

Example

Object o
integer hash = o:GetHashCode()

SetClustersSize(integer amount)

This sets the number of clusters expected when the algorithm has finished. The default is 3.

Parameters

  • integer amount