Fork me on GitHub

Following charts demonstrate dependence of clustering speed (inchlib_clust) and rendering (InCHlib) on the data size. The test data are randomly generated integers between 0 and 1000. We used the Euclidean distance and Ward’s linkage for clustering.

Testing PC configuration: Kubuntu 13.10 system with Chrome 33.0.1750.146 browser, Intel Core i5-2400 CPU 3.10 GHz, 8 GB RAM, 120 GB solid-state drive (SSD)
Clustering time vs. the number of data points

Clustering time increases quadratically with the increasing number of data points. This corresponds to the O(N²) complexity of the Ward’s linkage implementation in the fastcluster library. Apart from the time increase, memory needs also grow significantly with the number of data points. Clustering of 20,000 data points requires around 2 GB RAM.

Clustering time vs. the number of features

Clustering time increases with the increasing number of features only modestly. The dependence between the number of features and the clustering time is linear. If the total number of data is constant and data are clustered by rows, sets with less rows (e.g., 50 x 1000) are processed much faster than sets with higher number of rows (e.g., 1000 x 50). In the clustering, the time limited step is the formation of the distance matrix. Dimension of the distance matrix is given by the number of rows, and its size is constant with the increasing number of data features.

Rendering time vs. the size of the data

Rendering time depends linearly on the number of data values. It is also influenced by the PC configuration and used web browser. For smooth displaying, we recommend Google Chrome. Mozilla Firefox in Linux is, due to the lower speed of its JavaScript engine, unsuitable for large data sets.