pyemma.coordinates.pipeline

pyemma.coordinates.pipeline(stages, run=True, stride=1, chunksize=100)

Data analysis pipeline.

Constructs a data analysis Pipeline and parametrizes it (unless prevented). If this function takes too long, consider loading data in memory. Alternatively if the data is to large to be loaded into memory make use of the stride parameter.

Parameters:
  • stages (data input or list of pipeline stages) – If given a single pipeline stage this must be a data input constructed by source(). If a list of pipelining stages are given, the first stage must be a data input constructed by source().
  • run (bool, optional, default = True) – If True, the pipeline will be parametrized immediately with the given stages. If only an input stage is given, the run flag has no effect at this time. True also means that the pipeline will be immediately re-parametrized when further stages are added to it. Attention True means this function may take a long time to compute. If False, the pipeline will be passive, i.e. it will not do any computations before you call parametrize()
  • stride (int, optional, default = 1) – If set to 1, all input data will be used throughout the pipeline to parametrize its stages. Note that this could cause the parametrization step to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to parametrize the pipeline at a longer stride. See also stride option in the output functions of the pipeline.
  • chunksize (int, optiona, default = 100) – how many datapoints to process as a batch at one step
Returns:

pipe – A pipeline object that is able to conduct big data analysis with limited memory in streaming mode.

Return type:

Pipeline