pyemma.coordinates.discretizer

pyemma.coordinates.discretizer(reader, transform=None, cluster=None, run=True, stride=1, chunksize=None)

Specialized pipeline: From trajectories to clustering.

Constructs a pipeline that consists of three stages:

  1. an input stage (mandatory)

  2. a transformer stage (optional)

  3. a clustering stage (mandatory)

This function is identical to calling pipeline() with the three stages, it is only meant as a guidance for the (probably) most common usage cases of a pipeline.

Parameters
  • reader (instance of pyemma.coordinates.data.reader.ChunkedReader) – The reader instance provides access to the data. If you are working with MD data, you most likely want to use a FeatureReader.

  • transform (instance of :class: pyemma.coordinates.Transformer) – an optional transform like PCA/TICA etc.

  • cluster (instance of :class: pyemma.coordinates.AbstractClustering) – clustering Transformer (optional) a cluster algorithm to assign transformed data to discrete states.

  • stride (int, optional, default = 1) – If set to 1, all input data will be used throughout the pipeline to parametrize its stages. Note that this could cause the parametrization step to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to parametrize the pipeline at a longer stride. See also stride option in the output functions of the pipeline.

  • chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.

Returns

pipe – A pipeline object that is able to streamline data analysis of large amounts of input data with limited memory in streaming mode.

Return type

a Pipeline object

Examples

Construct a discretizer pipeline processing all data with a PCA transformation and cluster the principal components with uniform time clustering:

>>> from pyemma.coordinates import source, pca, cluster_regspace, discretizer
>>> from pyemma.datasets import get_bpti_test_data
>>> from pyemma.util.contexts import settings
>>> reader = source(get_bpti_test_data()['trajs'], top=get_bpti_test_data()['top'])
>>> transform = pca(dim=2)
>>> cluster = cluster_regspace(dmin=0.1)

Create the discretizer, access the the discrete trajectories and save them to files:

>>> with settings(show_progress_bars=False):
...     disc = discretizer(reader, transform, cluster)
...     disc.dtrajs 
[array([...

This will store the discrete trajectory to “traj01.dtraj”:

>>> from pyemma.util.files import TemporaryDirectory
>>> import os
>>> with TemporaryDirectory('dtrajs') as tmpdir:
...     disc.save_dtrajs(output_dir=tmpdir)
...     sorted(os.listdir(tmpdir))
['bpti_001-033.dtraj', 'bpti_034-066.dtraj', 'bpti_067-100.dtraj']
class pyemma.coordinates.pipelines.Pipeline(chain, chunksize=None, param_stride=1)

Data processing pipeline.

Methods

add_element(e)

Appends a pipeline stage.

parametrize()

Reads all data and discretizes it into discrete trajectories.

set_element(index, e)

Replaces a pipeline stage.

Attributes

add_element(e)

Appends a pipeline stage.

Appends the given element to the end of the current chain.

property chunksize
parametrize()

Reads all data and discretizes it into discrete trajectories.

set_element(index, e)

Replaces a pipeline stage.

Replace an element in chain and return replaced element.