pyemma.msm.its¶

pyemma.msm.its(dtrajs, lags=None, nits=None, reversible=True, connected=True, weights='empirical', errors=None, nsamples=50, n_jobs=None, show_progress=True, mincount_connectivity='1/n', only_timescales=False, core_set=None, milestoning_method='last_core')¶

Implied timescales from Markov state models estimated at a series of lag times.

Parameters

dtrajs (array-like or list of array-likes) – discrete trajectories
lags (int, array-like with integers or None, optional) – integer lag times at which the implied timescales will be calculated. If set to None (default) as list of lag times will be automatically generated. For a single int, generate a set of lag times starting from 1 to lags, using a multiplier of 1.5 between successive lags.
nits (int, optional) – number of implied timescales to be computed. Will compute less if the number of states are smaller. If None, the number of timescales will be automatically determined.
reversible (boolean, optional) – Estimate transition matrix reversibly (True) or nonreversibly (False)
connected (boolean, optional) – If true compute the connected set before transition matrix estimation at each lag separately
weights (str, optional) –
can be used to re-weight non-equilibrium data to equilibrium. Must be one of the following:
- ’empirical’: Each trajectory frame counts as one. (default)
- ’oom’: Each transition is re-weighted using OOM theory, see 5.
errors (None | 'bayes', optional) –
Specifies whether to compute statistical uncertainties (by default not), an which algorithm to use if yes. Currently the only option is:
- ’bayes’ for Bayesian sampling of the posterior
Attention: * The Bayes mode will use an estimate for the effective count matrix

that may produce somewhat different estimates than the ‘sliding window’ estimate used with errors=None by default.
- Computing errors can be// slow if the MSM has many states.
- There are still unsolved theoretical problems in the computation of effective count matrices, and therefore the uncertainty interval and the maximum likelihood estimator can be inconsistent. Use this as a rough guess for statistical uncertainties.
nsamples (int, optional) – The number of approximately independent transition matrix samples generated for each lag time for uncertainty quantification. Only used if errors is not None.
n_jobs (int, optional) – how many subprocesses to start to estimate the models for each lag time.
show_progress (bool, default=True) – whether to show progress of estimation.
mincount_connectivity (float or '1/n') – minimum number of counts to consider a connection between two states. Counts lower than that will count zero in the connectivity check and may thus separate the resulting transition matrix. The default evaluates to 1/nstates.
only_timescales (bool, default=False) – If you are only interested in the timescales and its samples, you can consider turning this on in order to save memory. This can be useful to avoid blowing up memory with BayesianMSM and lots of samples.
core_set (None (default) or array like, dtype=int) – Definition of core set for milestoning MSMs. If set to None, replaces state -1 (if found in discrete trajectories) and performs milestone counting. No effect for Voronoi-discretized trajectories (default). If a list or np.ndarray is supplied, discrete trajectories will be assigned accordingly.
milestoning_method (str) – Method to use for counting transitions in trajectories with unassigned frames. Currently available: | ‘last_core’, assigns unassigned frames to last visited core

Returns

itsobj

Return type

ImpliedTimescales object

Example

>>> from pyemma import msm
>>> dtraj = [0,1,1,2,2,2,1,2,2,2,1,0,0,1,1,1,2,2,1,1,2,1,1,0,0,0,1,1,2,2,1]   # mini-trajectory
>>> ts = msm.its(dtraj, [1,2,3,4,5], show_progress=False)
>>> print(ts.timescales)  
[[ 1.5...  0.2...]
 [ 3.1...  1.0...]
 [ 2.03...  1.02...]
 [ 4.63...  3.42...]
 [ 5.13...  2.59...]]

See also

ImpliedTimescales: The object returned by this function.
pyemma.plots.plot_implied_timescales: Implied timescales plotting function. Just call it with the ImpliedTimescales object produced by this function as an argument.

class pyemma.msm.estimators.implied_timescales.ImpliedTimescales(*args, **kwargs)¶

Methods

`estimate`(X, **params)	param X discrete trajectories
`fit`(X[, y])	Estimates parameters - for compatibility with sklearn.
`get_params`([deep])	Get parameters for this estimator.
`get_sample_conf`([conf, process])	Returns the confidence interval that contains alpha % of the sample data
`get_sample_mean`([process])	Returns the sample means of implied timescales.
`get_sample_std`([process])	Returns the standard error of implied timescales.
`get_timescales`([process])	Returns the implied timescale estimates
`load`(file_name[, model_name])	Loads a previously saved PyEMMA object from disk.
`save`(file_name[, model_name, overwrite, …])	saves the current state of this object to given file and name.
`set_params`(**params)	Set the parameters of this estimator.

Attributes

estimate(X, **params)¶

Parameters

X (lists of integer arrays) – discrete trajectories
estimator (Estimator) – Estimator to be used for estimating timescales at each lag time.
lags (array-like with integers or None, optional) – integer lag times at which the implied timescales will be calculated. If set to None (default) as list of lagtimes will be automatically generated.
nits (int, optional) – maximum number of implied timescales to be computed and stored. If less timescales are available, nits will be set to a smaller value during estimation. None means the number of timescales will be automatically determined.
n_jobs (int, optional) – how many subprocesses to start to estimate the models for each lag time.

property estimators¶: Returns the estimators for all lagtimes.

property fraction_of_frames¶

Returns the fraction of frames used to compute the count matrix at each lag time. .. rubric:: Notes

In a list of discrete trajectories with varying lengths, the estimation at longer lag times will mean discarding some trajectories for which not even one count can be computed. This function returns the fraction of frames that was actually used in computing the count matrix.

Be aware: this fraction refers to the full count matrix, and not that of the largest connected set. Hence, the output is not necessarily the active fraction. For that, use the activte_count_fraction function of the pyemma.msm.MaximumLikelihoodMSM class object or for HMM respectively.

get_sample_conf(conf=0.95, process=None)¶

Returns the confidence interval that contains alpha % of the sample data

etc.

Parameters

conf (float, default = 0.95) –

the confidence interval. Use:

conf = 0.6827 for 1-sigma confidence interval
conf = 0.9545 for 2-sigma confidence interval
conf = 0.9973 for 3-sigma confidence interval

Returns

(L,R) – lower and upper timescales bounding the confidence interval

if process is None, will return two (l x k) arrays, where l is the number of lag times and k is the number of computed timescales.
if process is an integer, will return two (l)-arrays with the selected process time scale for every lag time

Return type

(float[],float[]) or (float[][],float[][])

get_sample_mean(process=None)¶

Returns the sample means of implied timescales. Only available if underlying estimator produces samples.

Parameters

process (int or None, default = None) – index in [0:n-1] referring to the process whose timescale will be returned. By default, process = None and all computed process timescales will be returned.

Returns

if process is None, will return a (l x k) array, where l is the number of lag times
and k is the number of computed timescales.
if process is an integer, will return a (l) array with the selected process time scale
for every lag time

get_sample_std(process=None)¶

Returns the standard error of implied timescales. Only available if underlying estimator produces samples.

Parameters

process (int or None, default = None) – index in [0:n-1] referring to the process whose timescale will be returned. By default, process = None and all computed process timescales will be returned.

Returns

if process is None, will return a (l x k) array, where l is the number of lag times
and k is the number of computed timescales.
if process is an integer, will return a (l) array with the selected process time scale
for every lag time

get_timescales(process=None)¶

Returns the implied timescale estimates

Parameters

process (int or None, default = None) – index in [0:n-1] referring to the process whose timescale will be returned. By default, process = None and all computed process timescales will be returned.

Returns

if process is None, will return a (l x k) array, where l is the number of lag times
and k is the number of computed timescales.
if process is an integer, will return a (l) array with the selected process time scale
for every lag time

property lags¶: Return the list of lag times for which timescales were computed.

property lagtimes¶: Return the list of lag times for which timescales were computed.

property models¶: Returns the models for all lagtimes.

property nits¶: Return the number of timescales.

property number_of_timescales¶: Return the number of timescales.

property sample_mean¶

Returns the sample means of implied timescales. Need to generate the samples first, e.g. by calling bootstrap

Returns: timescales – mean timescales for all processes and lag times. l is the number of lag times and k is the number of computed timescales.
Return type: ndarray((l x k), dtype=float)

property sample_std¶

Returns the standard error of implied timescales. Only available if underlying estimator produces samples.

Returns: timescales – standard deviations of timescales for all processes and lag times. l is the number of lag times and k is the number of computed timescales.
Return type: ndarray((l x k), dtype=float)

property samples_available¶: Returns True if samples are available and thus sample means, standard errors and confidence intervals can be obtained

property timescales¶

Returns the implied timescale estimates

Returns: timescales – timescales for all processes and lag times. l is the number of lag times and k is the number of computed timescales.
Return type: ndarray((l x k), dtype=float)

References

Implied timescales as a lagtime-selection and MSM-validation approach were suggested in 1. Error estimation is done either using moving block bootstrapping 2 or a Bayesian analysis using Metropolis-Hastings Monte Carlo sampling of the posterior. Nonreversible Bayesian sampling is done by independently sampling Dirichtlet distributions of the transition matrix rows. A Monte Carlo method for sampling reversible MSMs was introduced in 3. Here we employ a much more efficient algorithm introduced in 4.

1: Swope, W. C. and J. W. Pitera and F. Suits: Describing protein folding kinetics by molecular dynamics simulations: 1. Theory. J. Phys. Chem. B 108: 6571-6581 (2004)
2: Kuensch, H. R.: The jackknife and the bootstrap for general stationary observations. Ann. Stat. 17, 1217-1241 (1989)
3: Noe, F.: Probability Distributions of Molecular Observables computed from Markov Models. J. Chem. Phys. 128, 244103 (2008)
4: Trendelkamp-Schroer, B, H. Wu, F. Paul and F. Noe: Estimation and uncertainty of reversible Markov models. http://arxiv.org/abs/1507.05990
5: Nueske, F., Wu, H., Prinz, J.-H., Wehmeyer, C., Clementi, C. and Noe, F.: Markov State Models from short non-Equilibrium Simulations - Analysis and

Correction of Estimation Bias J. Chem. Phys. (submitted) (2017)