pyemma.coordinates.tica

pyemma.coordinates.tica(data=None, lag=10, dim=- 1, var_cutoff=0.95, kinetic_map=True, commute_map=False, weights='empirical', stride=1, remove_mean=True, skip=0, reversible=True, ncov_max=inf, chunksize=None, **kwargs)

Time-lagged independent component analysis (TICA).

TICA is a linear transformation method. In contrast to PCA, which finds coordinates of maximal variance, TICA finds coordinates of maximal autocorrelation at the given lag time. Therefore, TICA is useful in order to find the slow components in a dataset and thus an excellent choice to transform molecular dynamics data before clustering data for the construction of a Markov model. When the input data is the result of a Markov process (such as thermostatted molecular dynamics), TICA finds in fact an approximation to the eigenfunctions and eigenvalues of the underlying Markov operator 1.

It estimates a TICA transformation from data. When input data is given as an argument, the estimation will be carried out straight away, and the resulting object can be used to obtain eigenvalues, eigenvectors or project input data onto the slowest TICA components. If no data is given, this object is an empty estimator and can be put into a pipeline() in order to use TICA in the streaming mode.

Parameters
  • data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by) – source function array with the data, if available. When given, the TICA transformation is immediately computed and can be used to transform data.

  • lag (int, optional, default = 10) – the lag time, in multiples of the input time step

  • dim (int, optional, default -1) – the number of dimensions (independent components) to project onto. A call to the map function reduces the d-dimensional input to only dim dimensions such that the data preserves the maximum possible autocorrelation amongst dim-dimensional linear projections. -1 means all numerically available dimensions will be used unless reduced by var_cutoff. Setting dim to a positive value is exclusive with var_cutoff.

  • var_cutoff (float in the range [0,1], optional, default 0.95) – Determines the number of output dimensions by including dimensions until their cumulative kinetic variance exceeds the fraction subspace_variance. var_cutoff=1.0 means all numerically available dimensions (see epsilon) will be used, unless set by dim. Setting var_cutoff smaller than 1.0 is exclusive with dim

  • kinetic_map (bool, optional, default True) – Eigenvectors will be scaled by eigenvalues. As a result, Euclidean distances in the transformed data approximate kinetic distances 4. This is a good choice when the data is further processed by clustering.

  • commute_map (bool, optional, default False) – Eigenvector_i will be scaled by sqrt(timescale_i / 2). As a result, Euclidean distances in the transformed data will approximate commute distances 5.

  • stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer.

  • weights (optional, default="empirical") –

    Re-weighting strategy to be used in order to compute equilibrium covariances from non-equilibrium data.
    • ”empirical”: no re-weighting

    • ”koopman”: use re-weighting procedure from 6

    • weights: An object that allows to compute re-weighting factors. It must possess a method

      weights(X) that accepts a trajectory X (np.ndarray(T, n)) and returns a vector of re-weighting factors (np.ndarray(T,)).

  • remove_mean (bool, optional, default True) – remove mean during covariance estimation. Should not be turned off.

  • skip (int, default=0) – skip the first initial n frames per trajectory.

  • reversible (bool, default=True) – symmetrize correlation matrices C_0, C_{tau}.

  • ncov_max (int, default=infinity) – limit the memory usage of the algorithm from 7 to an amount that corresponds to ncov_max additional copies of each correlation matrix

  • chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.

Returns

tica – Object for time-lagged independent component (TICA) analysis. it contains TICA eigenvalues and eigenvectors, and the projection of input data to the dominant TICA

Return type

a TICA transformation object

Notes

Given a sequence of multivariate data \(X_t\), it computes the mean-free covariance and time-lagged covariance matrix:

\[\begin{split}C_0 &= (X_t - \mu)^T \mathrm{diag}(w) (X_t - \mu) \\ C_{\tau} &= (X_t - \mu)^T \mathrm{diag}(w) (X_t + \tau - \mu)\end{split}\]

where w is a vector of weights for each time step. By default, these weights are all equal to one, but different weights are possible, like the re-weighting to equilibrium described in 6. Subsequently, the eigenvalue problem

\[C_{\tau} r_i = C_0 \lambda_i r_i,\]

is solved,where \(r_i\) are the independent components and \(\lambda_i\) are their respective normalized time-autocorrelations. The eigenvalues are related to the relaxation timescale by

\[t_i = -\frac{\tau}{\ln |\lambda_i|}.\]

When used as a dimension reduction method, the input data is projected onto the dominant independent components.

TICA was originally introduced for signal processing in 2. It was introduced to molecular dynamics and as a method for the construction of Markov models in 1 and 3. It was shown in 1 that when applied to molecular dynamics data, TICA is an approximation to the eigenvalues and eigenvectors of the true underlying dynamics.

Examples

Invoke TICA transformation with a given lag time and output dimension:

>>> import numpy as np
>>> from pyemma.coordinates import tica
>>> data = np.random.random((100,3))
>>> projected_data = tica(data, lag=2, dim=1).get_output()[0]

For a brief explaination why TICA outperforms PCA to extract a good reaction coordinate have a look here.

See also

TICA : tica object

pca : for principal component analysis

class pyemma.coordinates.transform.tica.TICA(*args, **kwargs)

Time-lagged independent component analysis (TICA)

Methods

describe()

Get a descriptive string representation of this class.

dimension()

output dimension

estimate(X, **kwargs)

Chunk-based parameterization of TICA.

fit(X[, y])

Estimates parameters - for compatibility with sklearn.

fit_transform(X[, y])

Fit to data, then transform it.

get_output([dimensions, stride, skip, chunk])

Maps all input data of this transformer and returns it as an array or list of arrays

get_params([deep])

Get parameters for this estimator.

iterator([stride, lag, chunk, …])

creates an iterator to stream over the (transformed) data.

load(file_name[, model_name])

Loads a previously saved PyEMMA object from disk.

n_chunks(chunksize[, stride, skip])

how many chunks an iterator of this sourcde will output, starting (eg.

n_frames_total([stride, skip])

Returns total number of frames.

number_of_trajectories([stride])

Returns the number of trajectories.

output_type()

By default transformers return single precision floats.

partial_fit(X)

incrementally update the covariances and mean.

save(file_name[, model_name, overwrite, …])

saves the current state of this object to given file and name.

set_params(**params)

Set the parameters of this estimator.

trajectory_length(itraj[, stride, skip])

Returns the length of trajectory of the requested index.

trajectory_lengths([stride, skip])

Returns the length of each trajectory.

transform(X)

Maps the input data through the transformer to correspondingly shaped output data array/list.

write_to_csv([filename, extension, …])

write all data to csv with numpy.savetxt

write_to_hdf5(filename[, group, …])

writes all data of this Iterable to a given HDF5 file.

Attributes

property data_producer

The data producer for this data source object (can be another data source object). :returns: :rtype: This data source’s data producer.

describe()

Get a descriptive string representation of this class.

estimate(X, **kwargs)

Chunk-based parameterization of TICA. Iterates over all data and estimates the mean, covariance and time lagged covariance. Finally, the generalized eigenvalue problem is solved to determine the independent components.

property model

The model estimated by this Estimator

partial_fit(X)

incrementally update the covariances and mean.

Parameters

X (array, list of arrays, PyEMMA reader) – input data.

Notes

The projection matrix is first being calculated upon its first access.

References

1(1,2,3)

Perez-Hernandez G, F Paul, T Giorgino, G De Fabritiis and F Noe. 2013. Identification of slow molecular order parameters for Markov model construction J. Chem. Phys. 139, 015102. doi:10.1063/1.4811489

2

L. Molgedey and H. G. Schuster. 1994. Separation of a mixture of independent signals using time delayed correlations Phys. Rev. Lett. 72, 3634.

3

Schwantes C, V S Pande. 2013. Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9 J. Chem. Theory. Comput. 9, 2000-2009. doi:10.1021/ct300878a

4

Noe, F. and Clementi, C. 2015. Kinetic distance and kinetic maps from molecular dynamics simulation. J. Chem. Theory. Comput. doi:10.1021/acs.jctc.5b00553

5

Noe, F., Banisch, R., Clementi, C. 2016. Commute maps: separating slowly-mixing molecular configurations for kinetic modeling. J. Chem. Theory. Comput. doi:10.1021/acs.jctc.6b00762

6(1,2)

Wu, H., Nueske, F., Paul, F., Klus, S., Koltai, P., and Noe, F. 2016. Bias reduced variational approximation of molecular kinetics from short off-equilibrium simulations. J. Chem. Phys. (submitted), https://arxiv.org/abs/1610.06773.

7

Chan, T. F., Golub G. H., LeVeque R. J. 1979. Updating formulae and pairwiese algorithms for computing sample variances. Technical Report STAN-CS-79-773, Department of Computer Science, Stanford University.