pyemma.coordinates.cluster_kmeans

pyemma.coordinates.cluster_kmeans(data=None, k=None, max_iter=10, tolerance=1e-05, stride=1, metric='euclidean', init_strategy='kmeans++', fixed_seed=False, n_jobs=None, chunksize=None, skip=0, keep_data=False, clustercenters=None, **kwargs)

k-means clustering

If data is given, it performs a k-means clustering and then assigns the data using a Voronoi discretization. It returns a KmeansClustering object that can be used to extract the discretized data sequences, or to assign other data points to the same partition. If data is not given, an empty KmeansClustering will be created that still needs to be parametrized, e.g. in a pipeline().

Parameters
  • data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by source()) – input data, if available in memory

  • k (int) – the number of cluster centers. When not specified (None), min(sqrt(N), 5000) is chosen as default value, where N denotes the number of data points

  • max_iter (int) – maximum number of iterations before stopping. When not specified (None), min(sqrt(N),5000) is chosen as default value, where N denotes the number of data points

  • tolerance (float) –

    stop iteration when the relative change in the cost function

    \(C(S) = \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2\)

    is smaller than tolerance.

  • stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer.

  • metric (str) – metric to use during clustering (‘euclidean’, ‘minRMSD’)

  • init_strategy (str) – determines if the initial cluster centers are chosen according to the kmeans++-algorithm or drawn uniformly distributed from the provided data set

  • fixed_seed (bool or (positive) integer) – if set to true, the random seed gets fixed resulting in deterministic behavior; default is false. If an integer >= 0 is given, use this to initialize the random generator.

  • n_jobs (int or None, default None) – Number of threads to use during assignment of the data. If None, all available CPUs will be used.

  • chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.

  • skip (int, default=0) – skip the first initial n frames per trajectory.

  • keep_data (boolean, default=False) – if you intend to quickly resume a non-converged kmeans iteration, set this to True. Otherwise the linear memory array will have to be re-created. Note that the data will also be deleted, if and only if the estimation converged within the given tolerance parameter.

  • clustercenters (ndarray (k, dim), default=None) – if passed, the init_strategy is ignored and these centers will be iterated.

Returns

kmeans – Object for kmeans clustering. It holds discrete trajectories and cluster center information.

Return type

a KmeansClustering clustering object

Examples

>>> import numpy as np
>>> from pyemma.util.contexts import settings
>>> import pyemma.coordinates as coor
>>> traj_data = [np.random.random((100, 3)), np.random.random((100,3))]
>>> with settings(show_progress_bars=False):
...     cluster_obj = coor.cluster_kmeans(traj_data, k=20, stride=1)
...     cluster_obj.get_output() 
[array([...

See also

Theoretical background: Wiki page

class pyemma.coordinates.clustering.kmeans.KmeansClustering(*args, **kwargs)

k-means clustering

Methods

assign([X, stride])

Assigns the given trajectory or list of trajectories to cluster centers by using the discretization defined by this clustering method (usually a Voronoi tesselation).

describe()

Get a descriptive string representation of this class.

dimension()

output dimension of clustering algorithm (always 1).

estimate(X, **kwargs)

Estimates the model given the data X

fit(X[, y])

Estimates parameters - for compatibility with sklearn.

fit_predict(X[, y])

Performs clustering on X and returns cluster labels.

fit_transform(X[, y])

Fit to data, then transform it.

get_model_params([deep])

Get parameters for this model.

get_output([dimensions, stride, skip, chunk])

Maps all input data of this transformer and returns it as an array or list of arrays

get_params([deep])

Get parameters for this estimator.

iterator([stride, lag, chunk, …])

creates an iterator to stream over the (transformed) data.

load(file_name[, model_name])

Loads a previously saved PyEMMA object from disk.

n_chunks(chunksize[, stride, skip])

how many chunks an iterator of this sourcde will output, starting (eg.

n_frames_total([stride, skip])

Returns total number of frames.

number_of_trajectories([stride])

Returns the number of trajectories.

output_type()

By default transformers return single precision floats.

sample_indexes_by_cluster(clusters, nsample)

Samples trajectory/time indexes according to the given sequence of states.

save(file_name[, model_name, overwrite, …])

saves the current state of this object to given file and name.

save_dtrajs([trajfiles, prefix, output_dir, …])

saves calculated discrete trajectories.

set_model_params(clustercenters)

set_params(**params)

Set the parameters of this estimator.

trajectory_length(itraj[, stride, skip])

Returns the length of trajectory of the requested index.

trajectory_lengths([stride, skip])

Returns the length of each trajectory.

transform(X)

Maps the input data through the transformer to correspondingly shaped output data array/list.

update_model_params(**params)

Update given model parameter if they are set to specific values

write_to_csv([filename, extension, …])

write all data to csv with numpy.savetxt

write_to_hdf5(filename[, group, …])

writes all data of this Iterable to a given HDF5 file.

Attributes

property converged
describe()

Get a descriptive string representation of this class.

property fixed_seed

seed for random choice of initial cluster centers. Fix this to get reproducible results.

property init_strategy

Strategy to get an initial guess for the centers.

References

The k-means algorithms was invented in 1. The term k-means was first used in 2.

1

Steinhaus, H. (1957). Sur la division des corps materiels en parties. Bull. Acad. Polon. Sci. (in French) 4, 801-804.

2

MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press. pp. 281-297