pyemma.msm.timescales_hmsm

pyemma.msm.timescales_hmsm(dtrajs, nstates, lags=None, nits=None, reversible=True, stationary=False, connectivity=None, mincount_connectivity='1/n', separate=None, errors=None, nsamples=100, stride=None, n_jobs=None, show_progress=True)

Calculate implied timescales from Hidden Markov state models estimated at a series of lag times.

Warning: this can be slow!

Parameters
  • dtrajs (array-like or list of array-likes) – discrete trajectories

  • nstates (int) – number of hidden states

  • lags (int, array-like with integers or None, optional) – integer lag times at which the implied timescales will be calculated. If set to None (default) as list of lag times will be automatically generated. For a single int, generate a set of lag times starting from 1 to lags, using a multiplier of 1.5 between successive lags.

  • nits (int (optional)) – number of implied timescales to be computed. Will compute less if the number of states are smaller. None means the number of timescales will be determined automatically.

  • connectivity (str, optional, default = None) –

    Defines if the resulting HMM will be defined on all hidden states or on a connected subset. Connectivity is defined by counting only transitions with at least mincount_connectivity counts. If a subset of states is used, all estimated quantities (transition matrix, stationary distribution, etc) are only defined on this subset and are correspondingly smaller than nstates. Following modes are available: * None or ‘all’ : The active set is the full set of states.

    Estimation is done on all weakly connected subsets separately. The resulting transition matrix may be disconnected.

    • ’largest’ : The active set is the largest reversibly connected set.

    • ’populous’The active set is the reversibly connected set with

      most counts.

  • mincount_connectivity (float or '1/n') – minimum number of counts to consider a connection between two states. Counts lower than that will count zero in the connectivity check and may thus separate the resulting transition matrix. The default evaluates to 1/nstates.

  • separate (None or iterable of int) – Force the given set of observed states to stay in a separate hidden state. The remaining nstates-1 states will be assigned by a metastable decomposition.

  • reversible (boolean (optional)) – Estimate transition matrix reversibly (True) or nonreversibly (False)

  • stationary (bool, optional, default=False) – If True, the initial distribution of hidden states is self-consistently computed as the stationary distribution of the transition matrix. If False, it will be estimated from the starting states. Only set this to true if you’re sure that the observation trajectories are initiated from a global equilibrium distribution.

  • errors (None | 'bayes') – Specifies whether to compute statistical uncertainties (by default not), an which algorithm to use if yes. The only option is currently ‘bayes’. This algorithm is much faster than MSM-based error calculation because the involved matrices are much smaller.

  • nsamples (int) – Number of approximately independent HMSM samples generated for each lag time for uncertainty quantification. Only used if errors is not None.

  • n_jobs (int) – how many subprocesses to start to estimate the models for each lag time.

  • show_progress (bool, default=True) – Show progressbars for calculation?

Returns

itsobj

Return type

ImpliedTimescales object

See also

ImpliedTimescales

The object returned by this function.

pyemma.plots.plot_implied_timescales

Plotting function for the ImpliedTimescales object

Example

>>> from pyemma import msm
>>> import numpy as np
>>> np.set_printoptions(precision=3)
>>> dtraj = [0,1,1,0,0,0,1,1,0,0,0,1,2,2,2,2,2,2,2,2,2,1,1,0,0,0,1,1,0,1,0]   # mini-trajectory
>>> ts = msm.timescales_hmsm(dtraj, 2, [1,2,3,4], show_progress=False)
>>> print(ts.timescales) 
[[ 5.786]
 [ 5.143]
 [ 4.44 ]
 [ 3.677]]
class pyemma.msm.estimators.implied_timescales.ImpliedTimescales(*args, **kwargs)

Methods

estimate(X, **params)

param X

discrete trajectories

fit(X[, y])

Estimates parameters - for compatibility with sklearn.

get_params([deep])

Get parameters for this estimator.

get_sample_conf([conf, process])

Returns the confidence interval that contains alpha % of the sample data

get_sample_mean([process])

Returns the sample means of implied timescales.

get_sample_std([process])

Returns the standard error of implied timescales.

get_timescales([process])

Returns the implied timescale estimates

load(file_name[, model_name])

Loads a previously saved PyEMMA object from disk.

save(file_name[, model_name, overwrite, …])

saves the current state of this object to given file and name.

set_params(**params)

Set the parameters of this estimator.

Attributes

estimate(X, **params)
Parameters
  • X (lists of integer arrays) – discrete trajectories

  • estimator (Estimator) – Estimator to be used for estimating timescales at each lag time.

  • lags (array-like with integers or None, optional) – integer lag times at which the implied timescales will be calculated. If set to None (default) as list of lagtimes will be automatically generated.

  • nits (int, optional) – maximum number of implied timescales to be computed and stored. If less timescales are available, nits will be set to a smaller value during estimation. None means the number of timescales will be automatically determined.

  • n_jobs (int, optional) – how many subprocesses to start to estimate the models for each lag time.

property estimators

Returns the estimators for all lagtimes.

property fraction_of_frames

Returns the fraction of frames used to compute the count matrix at each lag time. .. rubric:: Notes

In a list of discrete trajectories with varying lengths, the estimation at longer lag times will mean discarding some trajectories for which not even one count can be computed. This function returns the fraction of frames that was actually used in computing the count matrix.

Be aware: this fraction refers to the full count matrix, and not that of the largest connected set. Hence, the output is not necessarily the active fraction. For that, use the activte_count_fraction function of the pyemma.msm.MaximumLikelihoodMSM class object or for HMM respectively.

get_sample_conf(conf=0.95, process=None)

Returns the confidence interval that contains alpha % of the sample data

etc.

Parameters

conf (float, default = 0.95) –

the confidence interval. Use:

  • conf = 0.6827 for 1-sigma confidence interval

  • conf = 0.9545 for 2-sigma confidence interval

  • conf = 0.9973 for 3-sigma confidence interval

Returns

(L,R) – lower and upper timescales bounding the confidence interval

  • if process is None, will return two (l x k) arrays, where l is the number of lag times and k is the number of computed timescales.

  • if process is an integer, will return two (l)-arrays with the selected process time scale for every lag time

Return type

(float[],float[]) or (float[][],float[][])

get_sample_mean(process=None)

Returns the sample means of implied timescales. Only available if underlying estimator produces samples.

Parameters

process (int or None, default = None) – index in [0:n-1] referring to the process whose timescale will be returned. By default, process = None and all computed process timescales will be returned.

Returns

  • if process is None, will return a (l x k) array, where l is the number of lag times

  • and k is the number of computed timescales.

  • if process is an integer, will return a (l) array with the selected process time scale

  • for every lag time

get_sample_std(process=None)

Returns the standard error of implied timescales. Only available if underlying estimator produces samples.

Parameters

process (int or None, default = None) – index in [0:n-1] referring to the process whose timescale will be returned. By default, process = None and all computed process timescales will be returned.

Returns

  • if process is None, will return a (l x k) array, where l is the number of lag times

  • and k is the number of computed timescales.

  • if process is an integer, will return a (l) array with the selected process time scale

  • for every lag time

get_timescales(process=None)

Returns the implied timescale estimates

Parameters

process (int or None, default = None) – index in [0:n-1] referring to the process whose timescale will be returned. By default, process = None and all computed process timescales will be returned.

Returns

  • if process is None, will return a (l x k) array, where l is the number of lag times

  • and k is the number of computed timescales.

  • if process is an integer, will return a (l) array with the selected process time scale

  • for every lag time

property lags

Return the list of lag times for which timescales were computed.

property lagtimes

Return the list of lag times for which timescales were computed.

property models

Returns the models for all lagtimes.

property nits

Return the number of timescales.

property number_of_timescales

Return the number of timescales.

property sample_mean

Returns the sample means of implied timescales. Need to generate the samples first, e.g. by calling bootstrap

Returns

timescales – mean timescales for all processes and lag times. l is the number of lag times and k is the number of computed timescales.

Return type

ndarray((l x k), dtype=float)

property sample_std

Returns the standard error of implied timescales. Only available if underlying estimator produces samples.

Returns

timescales – standard deviations of timescales for all processes and lag times. l is the number of lag times and k is the number of computed timescales.

Return type

ndarray((l x k), dtype=float)

property samples_available

Returns True if samples are available and thus sample means, standard errors and confidence intervals can be obtained

property timescales

Returns the implied timescale estimates

Returns

timescales – timescales for all processes and lag times. l is the number of lag times and k is the number of computed timescales.

Return type

ndarray((l x k), dtype=float)

References

Implied timescales as a lagtime-selection and MSM-validation approach were suggested in 1. Hidden Markov state model estimation is done here as described in 2. For uncertainty quantification we employ the Bayesian sampling algorithm described in 3.

1

Swope, W. C. and J. W. Pitera and F. Suits: Describing protein folding kinetics by molecular dynamics simulations: 1. Theory. J. Phys. Chem. B 108: 6571-6581 (2004)

2

F. Noe, H. Wu, J.-H. Prinz and N. Plattner: Projected and hidden Markov models for calculating kinetics and metastable states of complex molecules. J. Chem. Phys. 139, 184114 (2013)

3

J. D. Chodera et al: Bayesian hidden Markov model analysis of single-molecule force spectroscopy: Characterizing kinetics under measurement uncertainty arXiv:1108.1430 (2011)