Libraries.Compute.Statistics.Clustering.ClusterByMeans Documentation
This class represents an approach to clustering data, similar to the KMeans++ algorithm. The original code was adapted from Apache Commons Math:https://commons.apache.org/proper/commons-math/download_math.cgi. As a TODO, there are other optimizations and features that are either not included from the original or that probably should be included. First, there are optimizations that exist for KMeans++ to reduce the initialization efforts: http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf. These have not been included. Second, this implementation needs to be made more flexible to include competing strategies for empty clusters and alternative distance computations. Currently, only Euclidian distance is included. This implementation requires that any of the values in the DataFrame must be an integer or a number, or convertable as such, and that no undefined values exist. In either case, an error will be thrown when the algorithm processes the values. It also requires the cluster count to be greater than 0 and the number of clusters must be strictly less than the number of rows.
Example Code
use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Clustering.ClusterByMeans
//make a DataFrame and toss some data in it
DataFrame frame
frame:LoadFromCommaSeparatedValue(
"X,Y
1,2
2,4
3,6
4,8
5,10
9,18
10,20
11,22
12,24
13,26"
)
//set the range of points on which we will calculate distance
frame:AddSelectedColumnRange(0,1)
output "Calculating K-means Clustering"
ClusterByMeans means
means:SetClustersSize(3)
//Clusters return an additional column with labels, so they can
//be included in a chart or other approach, per point
//If we want this in the DataFrame, we need to add it manually
ClusterResult result = means:Cluster(frame)
Array<Cluster> value = result:GetClusters()
IntegerColumn assignments = result:GetClusterIndices()
assignments:SetHeader("Clusters")
frame:AddColumn(assignments)
//we can also chart the clusters if
//we specify that the new clusters are a factor
frame:AddSelectedFactor(2)
ScatterPlot chart = frame:ScatterPlot()
chart:SetTitle("K-Means Clustering Demo")
chart:Display()
Inherits from: Libraries.Language.Object
Actions Documentation
Cluster(Libraries.Compute.Statistics.DataFrame frame)
This example states to cluster the DataFrame, with the particular selected columns, without taking any factors into account. By default, 3 clusters are selected and this value needs to be modified using SetClusterSize if a different number is desired.
Parameters
- Libraries.Compute.Statistics.DataFrame: The DataFrame we want to do our calculations on.
Return
Example
use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Clustering.ClusterByMeans
//make a DataFrame and toss some data in it
DataFrame frame
frame:LoadFromCommaSeparatedValue(
"X,Y
1,2
2,4
3,6
4,8
5,10
9,18
10,20
11,22
12,24
13,26"
)
//set the range of points on which we will calculate distance
frame:AddSelectedColumnRange(0,1)
output "Calculating K-means Clustering"
ClusterByMeans means
means:SetClustersSize(3)
//Clusters return an additional column with labels, so they can
//be included in a chart or other approach, per point
//If we want this in the DataFrame, we need to add it manually
ClusterResult result = means:Cluster(frame)
Array<Cluster> value = result:GetClusters()
IntegerColumn assignments = result:GetClusterIndices()
assignments:SetHeader("Clusters")
frame:AddColumn(assignments)
//we can also chart the clusters if
//we specify that the new clusters are a factor
frame:AddSelectedFactor(2)
ScatterPlot chart = frame:ScatterPlot()
chart:SetTitle("K-Means Clustering Demo")
chart:Display()
Cluster(Libraries.Compute.Statistics.DataFrame frame, integer seed)
This example states to cluster the DataFrame, with the particular selected columns, without taking any factors into account. By default, 3 clusters are selected and this value needs to be modified using SetClusterSize if a different number is desired.
Parameters
- Libraries.Compute.Statistics.DataFrame: The DataFrame we want to do our calculations on.
- integer seed: A set seed for the clustering.
Return
Example
use Libraries.Compute.Statistics.DataFrame
use Libraries.Compute.Statistics.Clustering.ClusterByMeans
//make a DataFrame and toss some data in it
DataFrame frame
frame:LoadFromCommaSeparatedValue(
"X,Y
1,2
2,4
3,6
4,8
5,10
9,18
10,20
11,22
12,24
13,26"
)
//set the range of points on which we will calculate distance
frame:AddSelectedColumnRange(0,1)
output "Calculating K-means Clustering"
ClusterByMeans means
means:SetClustersSize(3)
//Clusters return an additional column with labels, so they can
//be included in a chart or other approach, per point
//If we want this in the DataFrame, we need to add it manually
ClusterResult result = means:Cluster(frame, 42)
Array<Cluster> value = result:GetClusters()
IntegerColumn assignments = result:GetClusterIndices()
assignments:SetHeader("Clusters")
frame:AddColumn(assignments)
//we can also chart the clusters if
//we specify that the new clusters are a factor
frame:AddSelectedFactor(2)
ScatterPlot chart = frame:ScatterPlot()
chart:SetTitle("K-Means Clustering Demo")
chart:Display()
Compare(Libraries.Language.Object object)
This action compares two object hash codes and returns an integer. The result is larger if this hash code is larger than the object passed as a parameter, smaller, or equal. In this case, -1 means smaller, 0 means equal, and 1 means larger. This action was changed in Quorum 7 to return an integer, instead of a CompareResult object, because the previous implementation was causing efficiency issues.
Parameters
- Libraries.Language.Object: The object to compare to.
Return
integer: The Compare result, Smaller, Equal, or Larger.
Example
Object o
Object t
integer result = o:Compare(t) //1 (larger), 0 (equal), or -1 (smaller)
Equals(Libraries.Language.Object object)
This action determines if two objects are equal based on their hash code values.
Parameters
- Libraries.Language.Object: The to be compared.
Return
boolean: True if the hash codes are equal and false if they are not equal.
Example
use Libraries.Language.Object
use Libraries.Language.Types.Text
Object o
Text t
boolean result = o:Equals(t)
GetClustersSize()
This returns the number of clusters expected when the algorithm has finished. The default is 3.
Return
integer:
GetHashCode()
This action gets the hash code for an object.
Return
integer: The integer hash code of the object.
Example
Object o
integer hash = o:GetHashCode()
SetClustersSize(integer amount)
This sets the number of clusters expected when the algorithm has finished. The default is 3.
Parameters
- integer amount