pyemma.coordinates.cluster_kmeans¶
-
pyemma.coordinates.
cluster_kmeans
(data=None, k=100, max_iter=10, stride=1, metric='euclidean', init_strategy='kmeans++')¶ k-means clustering
If data is given, it performs a k-means clustering and then assigns the data using a Voronoi discretization. It returns a
KmeansClustering
object that can be used to extract the discretized data sequences, or to assign other data points to the same partition. If data is not given, an emptyKmeansClustering
will be created that still needs to be parametrized, e.g. in apipeline()
.See also
Theoretical background: Wiki page
Parameters: - data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by source function) – input data, if available in memory
- k (int) – the number of cluster centers
- stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer.
- metric (str) – metric to use during clustering (‘euclidean’, ‘minRMSD’)
- init_strategy (str) – determines if the initial cluster centers are chosen according to the kmeans++-algorithm or uniformly distributed
Returns: kmeans – Object for kmeans clustering. It holds discrete trajectories and cluster center information.
Return type: a
KmeansClustering
clustering objectExamples
>>> import numpy as np >>> import pyemma.coordinates as coor >>> traj_data = [np.random.random((100, 3)), np.random.random((100,3))] >>> cluster_obj = coor.cluster_kmeans(traj_data, k=20, stride=1) >>> cluster_obj.get_output() [array([0, 0, 1, ... ])]