pyemma.coordinates.discretizer

pyemma.coordinates.discretizer(reader, transform=None, cluster=None, run=True, stride=1, chunksize=100)

Specialized pipeline: From trajectories to clustering.

Constructs a pipeline that consists of three stages:

  1. an input stage (mandatory)
  2. a transformer stage (optional)
  3. a clustering stage (mandatory)

This function is identical to calling pipeline() with the three stages, it is only meant as a guidance for the (probably) most common usage cases of a pipeline.

Parameters:
  • reader (instance of pyemma.coordinates.data.reader.ChunkedReader) – The reader instance provides access to the data. If you are working with MD data, you most likely want to use a FeatureReader.
  • transform (instance of :class: pyemma.coordinates.Transformer) – an optional transform like PCA/TICA etc.
  • cluster (instance of :class: pyemma.coordinates.AbstractClustering) – clustering Transformer (optional) a cluster algorithm to assign transformed data to discrete states.
  • stride (int, optional, default = 1) – If set to 1, all input data will be used throughout the pipeline to parametrize its stages. Note that this could cause the parametrization step to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to parametrize the pipeline at a longer stride. See also stride option in the output functions of the pipeline.
  • chunksize (int, optiona, default = 100) – how many datapoints to process as a batch at one step
Returns:

pipe – A pipeline object that is able to streamline data analysis of large amounts of input data with limited memory in streaming mode.

Return type:

a Pipeline object

Examples

Construct a discretizer pipeline processing all data with a PCA transformation and cluster the principal components with uniform time clustering:

>>> import numpy as np
>>> from pyemma.coordinates import source, pca, cluster_regspace, discretizer
>>> from pyemma.datasets import get_bpti_test_data
>>> reader = source(get_bpti_test_data()['trajs'], top=get_bpti_test_data()['top'])
>>> transform = pca(dim=2)
>>> cluster = cluster_regspace(dmin=0.1)
>>> disc = discretizer(reader, transform, cluster)

Finally you want to run the pipeline:

>>> disc.parametrize()

Access the the discrete trajectories and saving them to files:

>>> disc.dtrajs 
[array([...

This will store the discrete trajectory to “traj01.dtraj”:

>>> from pyemma.util.files import TemporaryDirectory
>>> import os
>>> with TemporaryDirectory('dtrajs') as tmpdir:
...     disc.save_dtrajs(output_dir=tmpdir)
...     sorted(os.listdir(tmpdir))
['bpti_001-033.dtraj', 'bpti_034-066.dtraj', 'bpti_067-100.dtraj']
class pyemma.coordinates.pipelines.Pipeline(chain, chunksize=100, param_stride=1)

Data processing pipeline.

Methods

add_element(e) Appends a pipeline stage.
parametrize() Reads all data and discretizes it into discrete trajectories.
set_element(index, e) Replaces a pipeline stage.

Attributes

chunksize
add_element(e)

Appends a pipeline stage.

Appends the given element to the end of the current chain.

chunksize
parametrize()

Reads all data and discretizes it into discrete trajectories.

set_element(index, e)

Replaces a pipeline stage.

Replace an element in chain and return replaced element.