pyemma.coordinates.cluster_kmeans¶

pyemma.coordinates.cluster_kmeans(data=None, k=100, max_iter=10, stride=1, metric='euclidean', init_strategy='kmeans++')¶

k-means clustering

If data is given, it performs a k-means clustering and then assigns the data using a Voronoi discretization. It returns a KmeansClustering object that can be used to extract the discretized data sequences, or to assign other data points to the same partition. If data is not given, an empty KmeansClustering will be created that still needs to be parametrized, e.g. in a pipeline().

See also

Theoretical background: Wiki page

Parameters:	data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by source function) – input data, if available in memory k (int) – the number of cluster centers stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer. metric (str) – metric to use during clustering (‘euclidean’, ‘minRMSD’) init_strategy (str) – determines if the initial cluster centers are chosen according to the kmeans++-algorithm or uniformly distributed
Returns:	kmeans – Object for kmeans clustering. It holds discrete trajectories and cluster center information.
Return type:	a `KmeansClustering` clustering object

Examples

>>> import numpy as np
>>> import pyemma.coordinates as coor
>>> traj_data = [np.random.random((100, 3)), np.random.random((100,3))]
>>> cluster_obj = coor.cluster_kmeans(traj_data, k=20, stride=1)
>>> cluster_obj.get_output()
[array([0, 0, 1, ... ])]