pyemma.coordinates.pipeline

pyemma.coordinates.pipeline(stages, run=True, stride=1, chunksize=None)

Data analysis pipeline.

Constructs a data analysis Pipeline and parametrizes it (unless prevented). If this function takes too long, consider loading data in memory. Alternatively if the data is to large to be loaded into memory make use of the stride parameter.

Parameters
  • stages (data input or list of pipeline stages) – If given a single pipeline stage this must be a data input constructed by source(). If a list of pipelining stages are given, the first stage must be a data input constructed by source().

  • run (bool, optional, default = True) – If True, the pipeline will be parametrized immediately with the given stages. If only an input stage is given, the run flag has no effect at this time. True also means that the pipeline will be immediately re-parametrized when further stages are added to it. Attention True means this function may take a long time to compute. If False, the pipeline will be passive, i.e. it will not do any computations before you call parametrize()

  • stride (int, optional, default = 1) – If set to 1, all input data will be used throughout the pipeline to parametrize its stages. Note that this could cause the parametrization step to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to parametrize the pipeline at a longer stride. See also stride option in the output functions of the pipeline.

  • chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.

Returns

pipe – A pipeline object that is able to conduct big data analysis with limited memory in streaming mode.

Return type

Pipeline

Examples

>>> import numpy as np
>>> from pyemma.coordinates import source, tica, assign_to_centers, pipeline

Create some random data and cluster centers:

>>> data = np.random.random((1000, 3))
>>> centers = data[np.random.choice(1000, 10)]
>>> reader = source(data)

Define a TICA transformation with lag time 10:

>>> tica_obj = tica(lag=10)

Assign any input to given centers:

>>> assign = assign_to_centers(centers=centers)
>>> pipe = pipeline([reader, tica_obj, assign])
>>> pipe.parametrize()
class pyemma.coordinates.pipelines.Pipeline(chain, chunksize=None, param_stride=1)

Data processing pipeline.

Methods

add_element(e)

Appends a pipeline stage.

parametrize()

Reads all data and discretizes it into discrete trajectories.

set_element(index, e)

Replaces a pipeline stage.

Attributes

add_element(e)

Appends a pipeline stage.

Appends the given element to the end of the current chain.

property chunksize
parametrize()

Reads all data and discretizes it into discrete trajectories.

set_element(index, e)

Replaces a pipeline stage.

Replace an element in chain and return replaced element.