01 - Data-I/O and featurization

In this notebook, we will cover how to load (and visualize) data with PyEMMA. We are going to extract different features (collective variables) and compare how well these are suited for Markov state model building. Further we will introduce the concept of streaming data, which is mandatory to work with large data sets.

As with the other notebooks, we employ multiple examples. The idea is, first, to highlight the fundamental ideas with a non-physical test system (diffusion in a 2D double-well potential) and, second, to demonstrate real-world applications with molecular dynamics data.

Maintainers: [@cwehmeyer](https://github.com/cwehmeyer), [@marscher](https://github.com/marscher), [@thempel](https://github.com/thempel), [@psolsson](https://github.com/psolsson)

Remember: - to run the currently highlighted cell, hold ⇧ Shift and press ⏎ Enter; - to get help for a specific function, place the cursor within the function’s brackets, hold ⇧ Shift, and press ⇥ Tab; - you can find the full documentation at PyEMMA.org.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma
# for visualization of molecular structures:
import nglview
import mdtraj
from threading import Timer
from nglview.player import TrajectoryPlayer

Case 1: preprocessed data (toy model)

In the most convenient case, we already have preprocessed time series data available in some kind of archive which can be read using numpy:

file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']

print(data)

Please note that mdshare is a package designed for distributing our tutorial data; in general, only local file paths need to be provided.

Once we have the data in memory, we can use one of PyEMMA’s plotting functions to visualize what we have loaded:

pyemma.plots.plot_feature_histograms(data, feature_labels=['$x$', '$y$']);
../_images/01-data-io-and-featurization_5_0.png

The plot_feature_histograms() function visualizes the distributions of all degrees of freedom assuming that the columns of data represent different features and the rows represent different time steps.

While plot_feature_histograms() can handle arbitrary numbers of features, we have additional plotting functions for 2D projections. First, we visualize the sample density in the \(x/y\)-plane…

fig, ax, misc = pyemma.plots.plot_density(*data.T)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
fig.tight_layout()
../_images/01-data-io-and-featurization_7_0.png

… then, we show the corresponding pseudo free energy surface:

fig, ax, misc = pyemma.plots.plot_free_energy(*data.T, legacy=False)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
fig.tight_layout()
../_images/01-data-io-and-featurization_9_0.png

Both functions make a two-dimensional histogram for the given features; the free energy surface is defined by the negative logarithm of the probability computed from the histogram counts.

Please note that these functions visualize the density and free energy of the sampled data, not the equilibrium distribution of the underlying system. To account for nonequiblibrium data, you can supply frame-wise weights using the weights parameter. This will be covered in a follow-up notebook.

Case 2: loading *.xtc files (alanine dipeptide)

To load molecular dynamics data from one of the standard file formats ( *.xtc), we need not only the actual simulation data, but a topology file, too. This might differ for other formats though.

pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')
print(pdb)
print(files)

We can have a look at the structure with the aid of NGLView. We load the PDB file into memory with mdtraj and visualize it. The widget will auto-close after 30 second; if you want to watch it again, please execute the cell below again.

widget = nglview.show_mdtraj(mdtraj.load(pdb))
p = TrajectoryPlayer(widget)
widget.add_ball_and_stick()
p.spin = True
def stop_spin():
    p.spin = False
    widget.close()
Timer(30, stop_spin).start()
widget

In PyEMMA, the featurizer is a central object that incorporates the system’s topology. We start by creating it object using the topology file. Features are now easily computed by adding the target feature. If no feature is added, the featurizer will extract Cartesian coordinates.

feat = pyemma.coordinates.featurizer(pdb)

Now we pass the featurizer to the load function to extract the Cartesian coordinates from the trajectory files into memory. For real world examples one would prefer the source function, because usually one has more data available than main memory in the workstation.

The warning about plain coordinates is triggered, because these coordinates will include diffusion as a dynamical process, which might not be what one is interested in. If the molecule of interest has been aligned to a reference prior the analysis, it is fine to use these coordinates, but we will see that there are better choices.

data = pyemma.coordinates.load(files, features=feat)
print('type of data:', type(data))
print('lengths:', len(data))
print('shape of elements:', data[0].shape)
print('n_atoms:', feat.topology.n_atoms)

Next, we start adding features which we want to extract from the simulation data. Here, we want to load the backbone torsions, because these angles are known to describe all flexibility in the system. Since this feature is two dimensional, it is also easier to visualize. Please note that in complex systems it is not trivial to visualize plain input features.

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False)

Please note that the trajectories have been aligned to a reference structure before. Since in that case we loose track of the periodic box, we have to switch off the periodic flag for the torsion angle computations. By default PyEMMA assumes your simulation uses periodic boundary conditions.

We can always call the featurizer’s describe() method to show which features are requested. You might have noticed that you can combine arbitrary features by having multiple calls to add_ methods of the featurizer.

data = pyemma.coordinates.load(files, features=feat)

print('type of data:', type(data))
print('lengths:', len(data))
print('shape of elements:', data[0].shape)

After we have selected all desired features, we can call the load() function to load all features into memory or, alternatively, the source() function to create a streamed feature reader. For now, we will use load():

print(feat.describe())

Apparently, we have loaded a list of three two-dimensional numpy.ndarray objects from our three trajectory files. We can visualize these features using the aforementioned plotting functions, but to do so we have to concatenate the three individual trajectories:

data_concatenated = np.concatenate(data)
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat);
../_images/01-data-io-and-featurization_25_0.png

We can now measure the quantity of kinetic variance of the just selected feature by computing a VAMP-2 score wu-17. This enables us to distinguish features on how well they might be suited for MSM building. The minimum value of this score is 1, which corresponds to the invariant measure or the observed equilibrium.

With the dimension parameter we specify the amount of dynamic processes that we want to score. This is of importance later on, when we want to compare different input features. If we did not fix this number, we would not have an upper bound for the score. Please also note that we split our available data into training and test sets, where we leave out the last file in training and then use it as test. This is an important aspect in practice to avoid over-fitting the score.

score_phi_psi = pyemma.coordinates.vamp(data[:-1], dim=2).score(
        test_data=data[-1],
        score_method='VAMP2')
print('VAMP2-score backbone torsions: {:.2f}'.format(score_phi_psi))

The score of \(1.5\) means that we have the constant of \(1\) plus a total contribution of \(0.5\) from the other dynamic processes.

We now use PyEMMA’s plot_density() and plot_free_energy() functions to create Ramachandran plots of our system:

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True)
# the * operator used in a function call is used to unpack
# the iterable variable into its single elements.
pyemma.plots.plot_density(*data_concatenated.T, ax=axes[0])
pyemma.plots.plot_free_energy(*data_concatenated.T, ax=axes[1], legacy=False)
for ax in axes.flat:
    ax.set_xlabel('$\Phi$')
    ax.set_aspect('equal')
axes[0].set_ylabel('$\Psi$')
fig.tight_layout()
../_images/01-data-io-and-featurization_29_0.png

We note that the distribution in backbone torsion space contains several basins that will be assigned to metastable states in follow-up notebooks. Again, the free energy plot only depicts a pseudo free energy surface of the sampled data and was not re-weighted to equilibrium.

Let us look at a different featurization example and load the positions of all heavy atoms instead. We create a new featurizer object and use its add_selection() method to request the positions of a given selection of atoms. For this selection, we can use the select_Heavy() method which returns the indices of all heavy atoms.

Again, we load the data into memory and show what we loaded using the describe() method:

feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Heavy())

data = pyemma.coordinates.load(files, features=feat)

feat.describe()
['ATOM:ACE 1 CH3 1 x',
 'ATOM:ACE 1 CH3 1 y',
 'ATOM:ACE 1 CH3 1 z',
 'ATOM:ACE 1 C 4 x',
 'ATOM:ACE 1 C 4 y',
 'ATOM:ACE 1 C 4 z',
 'ATOM:ACE 1 O 5 x',
 'ATOM:ACE 1 O 5 y',
 'ATOM:ACE 1 O 5 z',
 'ATOM:ALA 2 N 6 x',
 'ATOM:ALA 2 N 6 y',
 'ATOM:ALA 2 N 6 z',
 'ATOM:ALA 2 CA 8 x',
 'ATOM:ALA 2 CA 8 y',
 'ATOM:ALA 2 CA 8 z',
 'ATOM:ALA 2 CB 10 x',
 'ATOM:ALA 2 CB 10 y',
 'ATOM:ALA 2 CB 10 z',
 'ATOM:ALA 2 C 14 x',
 'ATOM:ALA 2 C 14 y',
 'ATOM:ALA 2 C 14 z',
 'ATOM:ALA 2 O 15 x',
 'ATOM:ALA 2 O 15 y',
 'ATOM:ALA 2 O 15 z',
 'ATOM:NME 3 N 16 x',
 'ATOM:NME 3 N 16 y',
 'ATOM:NME 3 N 16 z',
 'ATOM:NME 3 C 18 x',
 'ATOM:NME 3 C 18 y',
 'ATOM:NME 3 C 18 z']

Please note that PyEMMA has flattened the \(x, y\) and \(z\) coordinates into an array that will be used for further analysis. We visualize the distributions of all loaded features:

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_33_0.png

Again we have a look at the VAMP-2 score of the heavy atom coordinates.

score_heavy_atoms = pyemma.coordinates.vamp(data[:-1], dim=2).score(
        test_data=data[:-1],
        score_method='VAMP2')
print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))
print('VAMP2-score xyz: {:f}'.format(score_heavy_atoms))

As we see, the score for the heavy atom positions is much higher as the one for the \(\phi/\psi\) torsion angles. The feature with a higher score should be favored for further analysis, because it means that this feature contains more information about slow processes. If you are already digging deeper into your system of interest, you can of course restrict the analysis to a set of features you already know describes your processes of interest, regardless of its VAMP score.

Another featurization that is interesting especially for proteins are pairwise heavy atom distances:

feat = pyemma.coordinates.featurizer(pdb)
heavy_atom_distance_pairs = feat.pairs(feat.select_Heavy())
feat.add_distances(heavy_atom_distance_pairs, periodic=False)
data = pyemma.coordinates.load(files, features=feat)

feat.describe()
['DIST: ACE 1 CH3 1 - ACE 1 C 4',
 'DIST: ACE 1 CH3 1 - ACE 1 O 5',
 'DIST: ACE 1 CH3 1 - ALA 2 N 6',
 'DIST: ACE 1 CH3 1 - ALA 2 CA 8',
 'DIST: ACE 1 CH3 1 - ALA 2 CB 10',
 'DIST: ACE 1 CH3 1 - ALA 2 C 14',
 'DIST: ACE 1 CH3 1 - ALA 2 O 15',
 'DIST: ACE 1 CH3 1 - NME 3 N 16',
 'DIST: ACE 1 CH3 1 - NME 3 C 18',
 'DIST: ACE 1 C 4 - ACE 1 O 5',
 'DIST: ACE 1 C 4 - ALA 2 N 6',
 'DIST: ACE 1 C 4 - ALA 2 CA 8',
 'DIST: ACE 1 C 4 - ALA 2 CB 10',
 'DIST: ACE 1 C 4 - ALA 2 C 14',
 'DIST: ACE 1 C 4 - ALA 2 O 15',
 'DIST: ACE 1 C 4 - NME 3 N 16',
 'DIST: ACE 1 C 4 - NME 3 C 18',
 'DIST: ACE 1 O 5 - ALA 2 N 6',
 'DIST: ACE 1 O 5 - ALA 2 CA 8',
 'DIST: ACE 1 O 5 - ALA 2 CB 10',
 'DIST: ACE 1 O 5 - ALA 2 C 14',
 'DIST: ACE 1 O 5 - ALA 2 O 15',
 'DIST: ACE 1 O 5 - NME 3 N 16',
 'DIST: ACE 1 O 5 - NME 3 C 18',
 'DIST: ALA 2 N 6 - ALA 2 CA 8',
 'DIST: ALA 2 N 6 - ALA 2 CB 10',
 'DIST: ALA 2 N 6 - ALA 2 C 14',
 'DIST: ALA 2 N 6 - ALA 2 O 15',
 'DIST: ALA 2 N 6 - NME 3 N 16',
 'DIST: ALA 2 N 6 - NME 3 C 18',
 'DIST: ALA 2 CA 8 - ALA 2 CB 10',
 'DIST: ALA 2 CA 8 - ALA 2 C 14',
 'DIST: ALA 2 CA 8 - ALA 2 O 15',
 'DIST: ALA 2 CA 8 - NME 3 N 16',
 'DIST: ALA 2 CA 8 - NME 3 C 18',
 'DIST: ALA 2 CB 10 - ALA 2 C 14',
 'DIST: ALA 2 CB 10 - ALA 2 O 15',
 'DIST: ALA 2 CB 10 - NME 3 N 16',
 'DIST: ALA 2 CB 10 - NME 3 C 18',
 'DIST: ALA 2 C 14 - ALA 2 O 15',
 'DIST: ALA 2 C 14 - NME 3 N 16',
 'DIST: ALA 2 C 14 - NME 3 C 18',
 'DIST: ALA 2 O 15 - NME 3 N 16',
 'DIST: ALA 2 O 15 - NME 3 C 18',
 'DIST: NME 3 N 16 - NME 3 C 18']

Now let us compare the score of heavy atom distance pairs to the other scores.

score_pair_heavy_atom_dists = pyemma.coordinates.vamp(data[:-1], dim=2).score(
    test_data=data[-1],
    score_method='VAMP2')
print('VAMP2-score: {:.2f}'.format(score_pair_heavy_atom_dists))

Apparently, the heavy atom distance pairs cover an amount of kinetic variance which is comparable to the coordinates themselves. However we probably would not use much information by using this internal degree of freedom, while avoiding the need to align our trajectories first.

load() versus source()

Using load(), we put the full data into memory. This is possible for all examples in this tutorial.

Many real world apllications, though, require more memory than your workstation might provide. For these cases, you should use the source() function:

reader = pyemma.coordinates.source(files, features=feat)
print(reader)

This function creates a reader, wich allows to stream the underlying data in chunks instead of the full set. Most of the functions in the coordinates sub-package accept data in memory as well as readers. However, some plotting functions require the data to be in memory. To load a (sub-sampled) subset into memory, we can use the get_output() method with a stride parameter:

data_output = reader.get_output(stride=5)
len(data_output)
print('number of frames in first file: {}'.format(reader.trajectory_length(0)))
print('number of frames after striding: {}'.format(len(data_output[0])))

We now have loaded every fifth frame into memory. Again, we can visualize the (concatenated) features with a feature histogram:

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(
    np.concatenate(data_output), feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_45_0.png

Testing your progress

In the remainder of this notebook, you will find short exercises to test your newly learned skills. The exercises are announced by the keyword Exercise and followed by an incomplete cell. Missing parts are indicated by

#FIXME

Exercise cells come with a button (Show Solution) to reveal the solution.

Exercise 1a: inverse heavy atom distances

Please fix the following code block such that the inverse distances between all heavy atoms are loaded and visualized.

Hint: try to use the auto-complete feature on the feat object to gain some insight. Also take a look at the previous demonstrations.

feat = pyemma.coordinates.featurizer(pdb)
pairs = feat.pairs(feat.select_Heavy())
feat.add_inverse_distances(pairs)

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(
    np.concatenate(data), feature_labels=feat, ax=ax)

fig.tight_layout()
../_images/01-data-io-and-featurization_47_0.png

Excercise 1b: compare the inverse distances feature to the previously computed ones

Compute and discuss a cross-validated VAMP score for the inverse pairwise heavy atom distances and plot the result. What do you observe and which feature would you choose for further processing?

score_inv_dist = pyemma.coordinates.vamp(
    data[:-1], dim=2).score(test_data=data[-1])

fig, ax = plt.subplots(figsize=(10, 7))
score_mapping = dict(score_heavy_atoms=score_heavy_atoms,
                     score_phi_psi=score_phi_psi,
                     score_pair_heavy_atom_dists=score_pair_heavy_atom_dists,
                     score_inv_dist=score_inv_dist)
lbl = []
for i, (key, value) in enumerate(sorted(score_mapping.items(), key=lambda x: x[1])):
    ax.bar(i, height=value)
    lbl.append(key)
ax.set_xticks(np.arange(0, len(score_mapping), 1))
ax.set_xticklabels(lbl)
fig.tight_layout()

# inversing the feature preserves the amount of kinetic variance, we should continue
# with the pairwise heavy atom distances.
../_images/01-data-io-and-featurization_49_0.png

Solution

feat = pyemma.coordinates.featurizer(pdb)
pairs = feat.pairs(feat.select_Heavy())
feat.add_distances(pairs, periodic=False)

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_51_0.png

Case 3: loading *.xtc files (pentapeptide)

Once we have obtained the raw data files…

pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

… and had a quick look at the structure again…

widget = nglview.show_mdtraj(mdtraj.load(pdb))
p = TrajectoryPlayer(widget)
widget.add_ball_and_stick()
p.spin = True
def stop_spin():
    p.spin = False
    widget.close()
Timer(30, stop_spin).start()
widget

… we can load a selection of features into memory. Here, we want the \(\cos/\sin\) transformations of the backbone and \(\chi_1\) sidechain torsions.

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(cossin=True, periodic=False)
feat.add_sidechain_torsions(which='chi1', cossin=True, periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

print(feat.describe())

Finally, we visualize the (concatenated) features:

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_59_0.png

Exercises: feature selection and visualization

Exercise 2

Complete the following code block to load/visualize the distances between all \(\text{C}_\alpha\) carbon atoms.

Hint: You might find the add_distances_ca() method of the featurizer object helpful.

Solution

feat = pyemma.coordinates.featurizer(pdb)
feat.add_distances_ca(periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_62_0.png

Exercise 3

Complete the following code block to load/visualize the minimal distances between all residues (excluding first and second neighbors).

Hint: You might find the add_residue_mindist() method of the featurizer object helpful.

Solution

feat = pyemma.coordinates.featurizer(pdb)
feat.add_residue_mindist(periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_65_2.png

Exercise 4

Complete the following code block to load/visualize the position of all backbone atoms.

Hint: You might find the select_Backbone() method of the featurizer object helpful.

Solution

feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Backbone())

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 12))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_68_0.png

Exercise 5

Complete the following code block to load/visualize the position of all \(\text{C}_\alpha\) atoms.

Hint: You might find the select_Ca() method of the featurizer object helpful.

Solution

feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Ca())

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()
../_images/01-data-io-and-featurization_71_0.png

Wrapping up

In this notebook, we have learned how to load and visualize molecular simulation data with PyEMMA. In detail, we have used - pyemma.coordinates.featurizer() to define a selection of features we want to extract, - pyemma.coordinates.load() to load data into memory, and - pyemma.coordinates.source() to create a streamed feature reader in case the data does not fit into memory.

After loading the data into memory, we have used - pyemma.coordinates.vamp().score() to score the quality of the features, - pyemma.plots.plot_feature_histograms() to show the distributions of all loaded features, - pyemma.plots.plot_density() to visualize the sample density, and - pyemma.plots.plot_free_energy() to visualize the free energy surface of two selected features.

References

[^]Wu, H. and Noé, F.. 2017. Variational approach for learning Markov processes from time series data.