05 - PCCA and TPT analysis
==========================

In this notebook, we will cover how to use PCCA++ to extract a coarse
representation of the MSM. We will further investigate how to use TPT to
follow the pathways of the processes. Here you can find literature on
the used methods:

-  roeblitz-weber-14
-  weinan-06
-  metzner-09

Maintainers: [@cwehmeyer](https://github.com/cwehmeyer),
[@marscher](https://github.com/marscher),
[@thempel](https://github.com/thempel),
[@psolsson](https://github.com/psolsson)

**Remember**: - to run the currently highlighted cell, hold ⇧ Shift and
press ⏎ Enter; - to get help for a specific function, place the cursor
within the function’s brackets, hold ⇧ Shift, and press ⇥ Tab; - you can
find the full documentation at `PyEMMA.org <http://www.pyemma.org>`__.

--------------

**Note:** We have assigned the integer numbers $1 :raw-latex:`\dots `$
``nstates`` to PCCA++ metastable states. As PyEMMA is written in Python,
it internally indexes states starting from :math:`0`. In consequence,
numbers in the code cells differ by :math:`-1` from the plot labels and
markdown text.

.. code:: ipython3

    %matplotlib inline
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    import numpy as np
    import mdshare
    import pyemma

Case 1: preprocessed, two-dimensional data (toy model)
------------------------------------------------------

We start by loading the data and the previously analyzed MSM from disk:

.. code:: ipython3

    file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
    with np.load(file) as fh:
        data = fh['trajectory']
    
    msm = pyemma.load('nb4.pyemma', model_name='doublewell_msm')
    bayesian_msm = pyemma.load('nb4.pyemma', model_name='doublewell_bayesian_msm')
    cluster = pyemma.load('nb4.pyemma', model_name='doublewell_cluster')

We currently have an MSM with 50 discrete states which was validated for
two metastable states in the previous notebook. Internally, the
metastable states have been computed using the Perron Cluster Cluster
Analysis (PCCA++) method roeblitz-14. Let’s analyze this in more detail
here. We can explicitly compute it by calling ``msm.pcca()``:

.. code:: ipython3

    nstates = 2
    msm.pcca(nstates)


.. parsed-literal::

    PCCA(P=array([[0.03604, 0.00458, ..., 0.00057, 0.02803],
           [0.00245, 0.06959, ..., 0.00996, 0.00215],
           ...,
           [0.00208, 0.06757, ..., 0.0104 , 0.00104],
           [0.02606, 0.00372, ..., 0.00027, 0.03351]]),
       m=2)


PCCA++ computes membership distributions, i.e. probabilities of
micro-states to belong to the same metastable state. It does so by using
the properties of slow processes in eigenvector space. Let us visualize
the membership distributions in the same fashion as before:

.. code:: ipython3

    fig, axes = plt.subplots(1, 2, figsize=(10, 4))
    for i, ax in enumerate(axes.flat):
        pyemma.plots.plot_contour(
            *data.T, msm.metastable_distributions[i][cluster.dtrajs[0]], ax=ax, cmap='afmhot_r', 
            mask=True, method='nearest', cbar_label='metastable distribution {}'.format(i + 1))
        ax.scatter(*cluster.clustercenters.T, s=15, c='k')
        ax.set_xlabel('$x$')
        ax.set_xlim(-4, 4)
        ax.set_ylim(-4, 4)
        ax.set_aspect('equal')
    axes[0].set_ylabel('$y$')
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_7_0.png


As expected, PCCA++ has assigned metastable states to the basins of the
double well. Since PCCA++, in simplified words, does a clustering in
eigenvector space and the first eigenvector separated these states
already, the nice separation comes to no surprise.

It is important to note, though, that PCCA++ in general does not yield a
coarse transition matrix. How to obtain this will be covered in the HMM
notebook. However, we can compute mean first passage times (MFPTs) and
equilibrium probabilities on the metastable sets and extract
representative structures.

The stationary probability of metastable states can simply be computed
by summing over all of its micro-states. (Please note that the PCCA
object returned by ``msm.pcca()`` also has a convenience function to do
that.)

.. code:: ipython3

    for i, s in enumerate(msm.metastable_sets):
        print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))

We use the ``mfpt()`` method of the original MSM object to compute MFPTs
between pairs of metastable sets (accessible via the ``metastable_sets``
attribute of the MSM object).

.. code:: ipython3

    mfpt = np.zeros((nstates, nstates))
    for i in range(nstates):
        for j in range(nstates):
            mfpt[i, j] = msm.mfpt(
                msm.metastable_sets[i],
                msm.metastable_sets[j])
    
    from pandas import DataFrame
    print('MFPT / steps:')
    DataFrame(np.round(mfpt, decimals=2), index=range(1, nstates + 1), columns=range(1, nstates + 1))


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>1</th>
          <th>2</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1</th>
          <td>0.00</td>
          <td>16.08</td>
        </tr>
        <tr>
          <th>2</th>
          <td>20.49</td>
          <td>0.00</td>
        </tr>
      </tbody>
    </table>
    </div>


As described above, the errors can be estimated from the Bayesian MSM.
Instead of just printing means and confidence intervals, let’s compute
the samples explicitly and histogram them.

.. code:: ipython3

    mfpt_sample = np.zeros((nstates, nstates, bayesian_msm.nsamples))
    for i in range(nstates):
        for j in range(nstates):
            mfpt_sample[i, j] = bayesian_msm.sample_f(
                'mfpt',
                msm.metastable_sets[i],
                msm.metastable_sets[j])
    
    fig, ax = plt.subplots()
    ax.hist(mfpt_sample[0, 1], histtype='step', label='MS 1 -> MS 2', density=True)
    ax.hist(mfpt_sample[1, 0], histtype='step', label='MS 2 -> MS 1', density=True)
    ax.set_xlabel('MFPT (steps)')
    ax.set_title('Bayesian MFPT sample histograms')
    fig.legend(loc=10);


.. image:: 05-pcca-tpt_files/05-pcca-tpt_13_0.png


We clearly see that there is no overlap of the distributions
approximated by the Bayesian MSM.

To do a more detailed analysis of the transition paths, we make use of
transition path theory (TPT) in its MSM formulation. We first analyze
the flux between the two metastable sets:

.. code:: ipython3

    A = msm.metastable_sets[0]
    B = msm.metastable_sets[1]
    flux = pyemma.msm.tpt(msm, A, B)

In TPT, many properties are derived from the committor functions. They
describe the probability of reaching a set :math:`A` before set
:math:`B` as a function of some state :math:`x`. In order to understand
this, we plot the committor of the previously defined sets as a function
of the cluster centers:

.. code:: ipython3

    fig, ax = plt.subplots(figsize=(5, 4))
    pyemma.plots.plot_contour(
        *data.T,
        flux.committor[cluster.dtrajs[0]],
        ax=ax,
        cmap='brg', 
        mask=True,
        cbar_label=r'committor A $\to$ B')
    ax.set_xlabel('$x$')
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    ax.set_aspect('equal')
    ax.set_ylabel('$y$')
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_17_0.png


We see that the committor for the double well data approximates a step
function between the two basins. In other words, the probability of
transitioning from metastable state :math:`A` to :math:`B` is only
:math:`1` if we already are in state :math:`B`. If we are in :math:`A`,
it is :math:`0` by definition. The clustering did not resolve the
transition region, so this particular example does not provide more
information. In the next example we will see more.

Case 2: low-dimensional molecular dynamics data (alanine dipeptide)
-------------------------------------------------------------------

Again, we load the model that we have estimated previously.

.. code:: ipython3

    pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
    files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')
    
    feat = pyemma.coordinates.featurizer(pdb)
    feat.add_backbone_torsions(periodic=False)
    data = pyemma.coordinates.load(files, features=feat)
    data_concatenated = np.concatenate(data)
    
    msm = pyemma.load('nb4.pyemma', model_name='ala2_msm')
    bayesian_msm = pyemma.load('nb4.pyemma', model_name='ala2_bayesian_msm')
    cluster = pyemma.load('nb4.pyemma', model_name='ala2_cluster')
    
    dtrajs_concatenated = np.concatenate(cluster.dtrajs)

In the previous notebook, we saw that four metastable states are a
reasonable choice for our MSM. We thus perform PCCA++ with this number
of states for further analysis and print out the stationary
probabilities of the metastable sets:

.. code:: ipython3

    nstates = 4
    msm.pcca(nstates)
    for i, s in enumerate(msm.metastable_sets):
        print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))

We visualize the metastable memberships:

.. code:: ipython3

    fig, axes = plt.subplots(1, 4, figsize=(15, 3))
    for i, ax in enumerate(axes.flat):
        pyemma.plots.plot_contour(
            *data_concatenated.T,
            msm.metastable_distributions[i][dtrajs_concatenated],
            ax=ax,
            cmap='afmhot_r', 
            mask=True,
            cbar_label='metastable distribution {}'.format(i + 1))
        ax.set_xlabel('$\Phi$')
    axes[0].set_ylabel('$\Psi$')
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_23_0.png


PCCA++ nicely separates the high-density regions and we find that each
of the basins was assigned a metastable set. This indicates that our
projection indeed describes the slow dynamics.

We concatenate all three discrete trajectories and obtain a single
trajectory of metastable states which we use to visualize the metastable
state memberships of all datapoints. We further compute the state with
the highest membership to a PCCA metastable state to plot a state label
there:

.. code:: ipython3

    metastable_traj = msm.metastable_assignments[dtrajs_concatenated]
    highest_membership = msm.metastable_distributions.argmax(1)
    coarse_state_centers = cluster.clustercenters[msm.active_set[highest_membership]]

Now, we use the ``mfpt()`` method of the MSM object to compute MFPTs
between pairs of metastable sets and compute the inverse MFPTs for
visualization purposes:

.. code:: ipython3

    mfpt = np.zeros((nstates, nstates))
    for i in range(nstates):
        for j in range(nstates):
            mfpt[i, j] = msm.mfpt(
                msm.metastable_sets[i],
                msm.metastable_sets[j])
    
    inverse_mfpt = np.zeros_like(mfpt)
    nz = mfpt.nonzero()
    inverse_mfpt[nz] = 1.0 / mfpt[nz]

We visualize our model in backbone torsion space:

.. code:: ipython3

    fig, ax = plt.subplots(figsize=(10, 7))
    _, _, misc = pyemma.plots.plot_state_map(
        *data_concatenated.T, metastable_traj, ax=ax, zorder=-1)
    misc['cbar'].set_ticklabels(range(1, nstates + 1))  # set state numbers 1 ... nstates
    
    pyemma.plots.plot_network(
        inverse_mfpt,
        pos=coarse_state_centers,
        figpadding=0,
        arrow_label_format='%.1f ps',
        arrow_labels=mfpt,
        size=12,
        show_frame=True,
        ax=ax)
    
    ax.set_xlabel('$\Phi$')
    ax.set_ylabel('$\Psi$')
    ax.set_xlim(-np.pi, np.pi)
    ax.set_ylim(-np.pi, np.pi)
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_29_0.png


Have you noticed how well the metastable state coloring agrees with the
eigenvector visualization of the three slowest processes?

If we could afford a shorter lag time, we might even be able to resolve
more processes and, thus, subdivide the metastable states three (fifth
slowest process) and zero (sixth slowest process).

Now we define a small function to visualize samples of metastable states
with NGLView.

.. code:: ipython3

    def visualize_metastable(samples, cmap, selection='backbone'):
        """ visualize metastable states
        Parameters
        ----------
        samples: list of mdtraj.Trajectory objects
            each element contains all samples for one metastable state.
        cmap: matplotlib.colors.ListedColormap
            color map used to visualize metastable states before.
        selection: str
            which part of the molecule to selection for visualization. For details have a look here:
            http://mdtraj.org/latest/examples/atom-selection.html#Atom-Selection-Language
        """
        import nglview
        from matplotlib.colors import to_hex
    
        widget = nglview.NGLWidget()
        widget.clear_representations()
        ref = samples[0]
        for i, s in enumerate(samples):
            s = s.superpose(ref)
            s = s.atom_slice(s.top.select(selection))
            comp = widget.add_trajectory(s)
            comp.add_ball_and_stick()
    
        # this has to be done in a separate loop for whatever reason...
        x = np.linspace(0, 1, num=len(samples))
        for i, x_ in enumerate(x):
            c = to_hex(cmap(x_))
            widget.update_ball_and_stick(color=c, component=i, repr_index=i)
            widget.remove_cartoon(component=i)
        return widget

We now sample some representative structures and visualize these with
the aid of NGLView. For the sake of clarity, we draw only the backbone
atoms. Since we have obtained several samples for each metastable state,
you can click the play button to iterate over all samples. For each
iteration, the samples of all four states will be drawn. You can double
click the molecule to show it at full screen. Press escape to go back.

.. code:: ipython3

    cmap = mpl.cm.get_cmap('viridis', nstates)
    
    my_samples = [pyemma.coordinates.save_traj(files, idist, outfile=None, top=pdb)
                  for idist in msm.sample_by_distributions(msm.metastable_distributions, 50)]
    
    visualize_metastable(my_samples, cmap, selection='backbone')

Coming back to TPT, we now have more than two metastable states and can
expect more insights from analyzing the transition paths. As an example,
we will focus on the committor between metastable sets :math:`0` and
:math:`3` as defined above.

.. code:: ipython3

    A = msm.metastable_sets[0]
    B = msm.metastable_sets[3]
    flux = pyemma.msm.tpt(msm, A, B)

Before we go on with the visualization, let’s coarse grain the flux with
the metastable sets estimated with PCCA++:

.. code:: ipython3

    cg, cgflux = flux.coarse_grain(msm.metastable_sets)

We now show an overlay of the committor probabilities and the most
likely transition path from the coarse graining TPT:

.. code:: ipython3

    fig, ax = plt.subplots(figsize=(10, 7))
    
    pyemma.plots.plot_contour(
        *data_concatenated.T,
        flux.committor[dtrajs_concatenated],
        cmap='brg',
        ax=ax,
        mask=True,
        cbar_label=r'committor 1 $\to$ 4',
        alpha=0.8,
        zorder=-1);
    
    pyemma.plots.plot_flux(
        cgflux,
        coarse_state_centers,
        cgflux.stationary_distribution,
        state_labels=['A','' ,'', 'B'], 
        ax=ax,
        show_committor=False,
        figpadding=0,
        show_frame=True,
        arrow_label_format='%2.e / ps');
    
    ax.set_xlabel('$\Phi$')
    ax.set_ylabel('$\Psi$')
    ax.set_xlim(-np.pi, np.pi)
    ax.set_ylim(-np.pi, np.pi)
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_39_0.png


First, the color map shows us a region with committor probability
:math:`\approx 0.5`. This indicates that this particular metastable
state is a transition state in the pathway from :math:`A` to :math:`B`.
Second, the ``plot_flux()`` function displays the most likely transition
pathway along this path. There are other, less likely pathways included
in the plot as well. The arrow thickness indicates the flux between the
states.

We can decompose the flux into these individual pathways by:

.. code:: ipython3

    paths, path_fluxes = cgflux.pathways(fraction=0.99)
    print('percentage       \tpath')
    print('-------------------------------------')
    for i in range(len(paths)):
        print(np.round(path_fluxes[i] / np.sum(path_fluxes), 3),' \t', paths[i] + 1)

As expected, about :math:`85\%` of the flux goes through only one
pathway. To get a cleaner picture, the ``plot_flux()`` function supports
a ``minflux`` keyword argument that can be increased to exclude very low
fluxes from the plot.

Exercise 1
^^^^^^^^^^

Define a ``featurizer`` that loads the heavy atom coordinates and load
the data into memory. Also load the TICA object from the previous
notebook to transform the featurized data. Further, estimated MSM,
Bayesian MSM, and Cluster objects should be loaded from disk.

Solution
        

.. code:: ipython3

    feat = pyemma.coordinates.featurizer(pdb)
    pairs = feat.pairs(feat.select_Heavy())
    feat.add_distances(pairs, periodic=False)
    data = pyemma.coordinates.load(files, features=feat)
    
    tica = pyemma.load('nb4.pyemma', model_name='ala2tica_tica')
    tica_output = tica.transform(data)
    tica_concatenated = np.concatenate(tica_output)
    
    msm = pyemma.load('nb4.pyemma', model_name='ala2tica_msm')
    bayesian_msm = pyemma.load('nb4.pyemma', model_name='ala2tica_bayesian_msm')
    
    cluster = pyemma.load('nb4.pyemma', model_name='ala2tica_cluster')
    dtrajs_concatenated = np.concatenate(cluster.dtrajs)

Exercise 2
^^^^^^^^^^

Do a PCCA++ analysis of the MSM with four metastable states, compute the
probability of the metastable sets, and visualize the metastable state
memberships.

Solution
        

.. code:: ipython3

    nstates = 4
    msm.pcca(nstates)
    
    for i, s in enumerate(msm.metastable_sets):
        print('π_{} = {:f}'.format(i + 1, msm.pi[s].sum()))
    
    fig, axes = plt.subplots(1, 4, figsize=(15, 3))
    for i, ax in enumerate(axes.flat):
        pyemma.plots.plot_contour(
            *tica_concatenated.T,
            msm.metastable_distributions[i][dtrajs_concatenated],
            ax=ax,
            cmap='afmhot_r', 
            mask=True,
            cbar_label='metastable distribution {}'.format(i + 1))
        ax.set_xlabel('IC 1')
    axes[0].set_ylabel('IC 2')
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_48_1.png


Did you guess the metastable states correctly?

Note the similarities between the MSM built from the backbone torsions
and the MSM built from the TICA projection of heavy atom distances. Even
though we started from different features, both models found the same
kinetic information in the data.

Exercise 3
^^^^^^^^^^

Compute the pairwise MFPTs and transition rates, and visualize the
resulting kinetic network.

Solution
        

.. code:: ipython3

    mfpt = np.zeros((nstates, nstates))
    for i in range(nstates):
        for j in range(nstates):
            mfpt[i, j] = msm.mfpt(
                msm.metastable_sets[i],
                msm.metastable_sets[j])
    
    inverse_mfpt = np.zeros_like(mfpt)
    nz = mfpt.nonzero()
    inverse_mfpt[nz] = 1.0 / mfpt[nz]
    
    pyemma.plots.plot_network(
        inverse_mfpt,
        pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
        arrow_label_format='%.1f ps',
        arrow_labels=mfpt,
        arrow_scale=3.0,
        state_labels=range(1, nstates + 1),
        size=12);


.. image:: 05-pcca-tpt_files/05-pcca-tpt_51_0.png


Exercise 4
^^^^^^^^^^

Compute the TPT object, coarse grain it onto the PCCA++ metastable sets
and visualize the flux along with the committor probabilities.

Solution
        

.. code:: ipython3

    A = msm.metastable_sets[0]
    B = msm.metastable_sets[3]
    flux = pyemma.msm.tpt(msm, A, B)
    
    cg, cgflux = flux.coarse_grain(msm.metastable_sets)
    
    highest_membership = msm.metastable_distributions.argmax(1)
    coarse_state_centers = cluster.clustercenters[msm.active_set[highest_membership]]
    
    fig, ax = plt.subplots(figsize=(10, 7))
    
    pyemma.plots.plot_contour(
        *tica_concatenated.T,
        flux.committor[dtrajs_concatenated],
        cmap='brg',
        ax=ax,
        mask=True,
        cbar_label=r'committor 1 $\to$ 4',
        zorder=-1)
    
    pyemma.plots.plot_flux(
        cgflux,
        coarse_state_centers,
        cgflux.stationary_distribution,
        ax=ax,
        show_committor=False,
        figpadding=0.2,
        state_labels=['A', '', '', 'B'],
        arrow_label_format='%2.e / ps')
    
    ax.set_xlabel('IC 1')
    ax.set_ylabel('IC 2')
    ax.set_aspect('equal')
    ax.set_xlim(tica_concatenated[:, 0].min(), tica_concatenated[:, 0].max())
    ax.set_ylim(tica_concatenated[:, 1].min(), tica_concatenated[:, 1].max())
    fig.tight_layout()


.. image:: 05-pcca-tpt_files/05-pcca-tpt_54_0.png


Wrapping up
-----------

In this notebook, we have learned how to use PCCA++ using an existing
MSM and how to extract kinetic information from the model. In detail, we
have used - the ``pcca()`` method of an MSM object to find metastable
states, - the ``mfpt()`` method of an MSM object to compute mean first
passage times between metastable states which, in turn, are accessible
via - the ``metastable_sets`` and ``metastable_assignments`` attributes
of an MSM object.

For visualizing MSMs or kinetic networks we used -
``pyemma.plots.plot_density()``, ``pyemma.plots.plot_contour()`` and -
``pyemma.plots.plot_network()``.

References
----------

[:sup:`][`]Susanna Röblitz and Marcus Weber. 2013. *Fuzzy spectral
clustering by PCCA+: application to Markov state models and data
classification*. `URL <https://doi.org/10.1007/s11634-013-0134-6>`__

[^]Weinan E. and Eric Vanden-Eijnden. 2006. *Towards a Theory of
Transition Paths*. `URL <https://doi.org/10.1007/s10955-005-9003-9>`__

[^]Philipp Metzner and Christof Schütte and Eric Vanden-Eijnden. 2009.
*Transition Path Theory for Markov Jump Processes*.
`URL <https://doi.org/10.1137/070699500>`__