Save your models in PyEMMA
==========================

Most of the Estimators and Models in PyEMMA are serializable. If a given
Estimator or Model can be saved to disk, it provides a **save** method.
In this notebook we will explain the basic concepts of file handling.

We try our best to provide **future** compatiblity of already saved
data. This means it should always be possible to load data with a newer
version of the software, but you can not do reverse, eg. load a model
saved by a new version with an old version of PyEMMA.

If you are interested in the technical background, go ahead and read the
source code (it is not that much actually).

.. code:: ipython3

    import pyemma
    import numpy as np
    import os
    import pprint
    pyemma.config.mute = True

.. code:: ipython3

    # delete all saved data
    def rm_models():
        import glob
        for f in glob.glob('*.h5'):
            os.unlink(f)
    rm_models()

.. code:: ipython3

    # generate some artificial data with 10 states
    dtrajs = [np.random.randint(0, 10, size=10000) for _ in range(5)]

.. code:: ipython3

    # estimate a Bayesian Markov state model
    bmsm = pyemma.msm.bayesian_markov_model(dtrajs, lag=10)
    print(bmsm)


.. parsed-literal::

    BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
          dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
          nsteps=3, reversible=True, show_progress=False, sparse=False,
          statdist_constraint=None)


We can now save the estimator (which contains the model) to disk.

.. code:: ipython3

    # now save our model
    bmsm.save('my_models.h5')

We can now restore the model, by simply invoking pyemma.load function
with our file name.

.. code:: ipython3

    pyemma.load('my_models.h5')


.. parsed-literal::

    BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
          dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
          nsteps=3, reversible=True, show_progress=False, sparse=False,
          statdist_constraint=None)


Note that we can save multiple models in one file. Because HDF5 acts
like a file system, we have each model in a separate “folder”, which is
completely independent of the other models. We now change a parameter
during estimation and save the estimator again in the same file, but in
a different “folder”.

.. code:: ipython3

    bmsm.estimate(dtrajs, lag=100)
    print(bmsm)
    bmsm.save('my_models.h5', model_name='lag100')


.. parsed-literal::

    BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
          dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
          nsteps=3, reversible=True, show_progress=False, sparse=False,
          statdist_constraint=None)


Likewise when we want to restore the model with the new name, we have to
pass it to the load function accordingly.

.. code:: ipython3

    pyemma.load('my_models.h5', model_name='lag100')


.. parsed-literal::

    BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
          dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
          nsteps=3, reversible=True, show_progress=False, sparse=False,
          statdist_constraint=None)


As you may have noted, there is no need to pass a model name. For
convenience we always save under model_name “latest”, if the argument is
not provided. To check which models are contained in a file, we provide
a command line tool named “pyemma_list_models”.

.. code:: ipython3

    ! pyemma_list_models


.. parsed-literal::

    usage: pyemma_list_models [-h] [--json] [--recursive] [-v] files [files ...]
    pyemma_list_models: error: the following arguments are required: files


.. code:: ipython3

    ! pyemma_list_models my_models.h5


.. parsed-literal::

    PyEMMA models
    =============
    
    file: my_models.h5
    --------------------------------------------------------------------------------
    1. name: default
    created: Tue Apr 10 18:52:26 2018
    BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
          dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
          nsteps=3, reversible=True, show_progress=False, sparse=False,
          statdist_constraint=None)
    2. name: lag100
    created: Tue Apr 10 18:52:27 2018
    BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
          dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
          nsteps=3, reversible=True, show_progress=False, sparse=False,
          statdist_constraint=None)
    --------------------------------------------------------------------------------
    

You can also check the list of already stored models directly in PyEMMA.

.. code:: ipython3

    content = pyemma.list_models('my_models.h5')
    print("available models:", content.keys())
    print("-" * 80)
    print("detailed:")
    pprint.pprint(content)


.. parsed-literal::

    available models: dict_keys(['default', 'lag100'])
    --------------------------------------------------------------------------------
    detailed:
    {'default': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', "
                               "count_mode='effective',\n"
                               "      dt_traj='1 step', lag=10, "
                               "mincount_connectivity='1/n', nsamples=100,\n"
                               '      nsteps=3, reversible=True, '
                               'show_progress=False, sparse=False,\n'
                               '      statdist_constraint=None)',
                 'class_str': "BayesianMSM(conf=0.95, connectivity='largest', "
                              "count_mode='effective',\n"
                              "      dt_traj='1 step', lag=10, "
                              "mincount_connectivity='1/n', nsamples=100,\n"
                              '      nsteps=3, reversible=True, '
                              'show_progress=False, sparse=False,\n'
                              '      statdist_constraint=None)',
                 'created': 1523379146.8274236,
                 'created_readable': 'Tue Apr 10 18:52:26 2018',
                 'digest': '88b20f42dcdd39947b4d0b04794b54695b84833eb408295338d41181f0cab46b',
                 'pyemma_version': '2.5.2',
                 'saved_streaming_chain': False},
     'lag100': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', "
                              "count_mode='effective',\n"
                              "      dt_traj='1 step', lag=100, "
                              "mincount_connectivity='1/n', nsamples=100,\n"
                              '      nsteps=3, reversible=True, '
                              'show_progress=False, sparse=False,\n'
                              '      statdist_constraint=None)',
                'class_str': "BayesianMSM(conf=0.95, connectivity='largest', "
                             "count_mode='effective',\n"
                             "      dt_traj='1 step', lag=100, "
                             "mincount_connectivity='1/n', nsamples=100,\n"
                             '      nsteps=3, reversible=True, '
                             'show_progress=False, sparse=False,\n'
                             '      statdist_constraint=None)',
                'created': 1523379147.449291,
                'created_readable': 'Tue Apr 10 18:52:27 2018',
                'digest': '23594166df674205f8dcb5ed9d5c7b64ac8ffd942e00fde91ba26c73cae6e7fe',
                'pyemma_version': '2.5.2',
                'saved_streaming_chain': False}}


Overwriting existing models is also possible, but we have to tell the
save method, that we want to overwrite.

.. code:: ipython3

    # we now expect that we get a failure, because the model already exists in the file.
    try:
        bmsm.save('my_models.h5')
    except RuntimeError as e:
        print("can not save:", e)


.. parsed-literal::

    can not save: model "default" already exists. Either use overwrite=True, or use a different name/file.


.. code:: ipython3

    bmsm.save('my_models.h5', overwrite=True)

Save Pipelines
--------------

In PyEMMA coordinates one often has chains of Estimators, eg. a reader
followed by some transformations and finally a clustering. If you want
to preserve this definition of data flow, you can set this during save.

.. code:: ipython3

    # create some data, note that this in principle could also be a FeatureReader used to process MD data.
    data = np.random.random((1000, 2))
    from pyemma.coordinates import source, tica, cluster_kmeans
    
    reader = source(data)
    tica = tica(reader, lag=10)
    clust = cluster_kmeans(tica)
    
    print('clustering:', clust)
    print('tica:', tica)
    print('source:', reader)


.. parsed-literal::

    clustering: KmeansClustering(clustercenters=array([[-0.04867, -0.00542],
           [ 0.04776, -0.00615],
           [ 0.00866,  0.02063],
           [-0.00727, -0.02088],
           [-0.03013,  0.01751],
           [ 0.03868,  0.03487],
           [ 0.01011, -0.04399],
           [-0.07364,  0.0013 ],
           [ 0.07898,  0.00271],
           [ 0.0187 , -0.016...     [-0.04348,  0.02851],
           [-0.0293 , -0.03614],
           [-0.02583,  0.03866]], dtype=float32),
             fixed_seed=1206330312, init_strategy='kmeans++', keep_data=False,
             max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,
             oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)
    tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,
       ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
       weights=None)
    source: DataInMemory(data=[array([[ 0.0849984 ,  0.75388626],
           [ 0.76335831,  0.23260134],
           [ 0.20315025,  0.95727845],
           ..., 
           [ 0.82022961,  0.42732264],
           [ 0.95787558,  0.24492272],
           [ 0.40078893,  0.32759295]])], chunksize=16777216)


The setting “save_streaming_chain” controls, if we want to save the
input chain of the object being saved.

.. code:: ipython3

    clust.save('pipeline.h5', save_streaming_chain=True)

The list models tools will also show the saved chain in a human readable
fashion.

.. code:: ipython3

    ! pyemma_list_models pipeline.h5


.. parsed-literal::

    PyEMMA models
    =============
    
    file: pipeline.h5
    --------------------------------------------------------------------------------
    1. name: default
    created: Tue Apr 10 18:52:30 2018
    KmeansClustering(clustercenters=array([[-0.04867, -0.00542],
           [ 0.04776, -0.00615],
           [ 0.00866,  0.02063],
           [-0.00727, -0.02088],
           [-0.03013,  0.01751],
           [ 0.03868,  0.03487],
           [ 0.01011, -0.04399],
           [-0.07364,  0.0013 ],
           [ 0.07898,  0.00271],
           [ 0.0187 , -0.016...     [-0.04348,  0.02851],
           [-0.0293 , -0.03614],
           [-0.02583,  0.03866]], dtype=float32),
             fixed_seed=1206330312, init_strategy='kmeans++', keep_data=False,
             max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,
             oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)
    
    ---------Input chain---------
    1. DataInMemory(data=[array([[ 0.0849984 ,  0.75388626],
           [ 0.76335831,  0.23260134],
           [ 0.20315025,  0.95727845],
           ..., 
           [ 0.82022961,  0.42732264],
           [ 0.95787558,  0.24492272],
           [ 0.40078893,  0.32759295]])], chunksize=16777216)
    2. TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,
       ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
       weights=None)
    --------------------------------------------------------------------------------
    

.. code:: ipython3

    restored = pyemma.load('pipeline.h5')
    
    print('clustering:', restored)
    print('tica:', restored.data_producer)
    print('source:', restored.data_producer.data_producer)


.. parsed-literal::

    clustering: KmeansClustering(clustercenters=array([[-0.04867, -0.00542],
           [ 0.04776, -0.00615],
           [ 0.00866,  0.02063],
           [-0.00727, -0.02088],
           [-0.03013,  0.01751],
           [ 0.03868,  0.03487],
           [ 0.01011, -0.04399],
           [-0.07364,  0.0013 ],
           [ 0.07898,  0.00271],
           [ 0.0187 , -0.016...     [-0.04348,  0.02851],
           [-0.0293 , -0.03614],
           [-0.02583,  0.03866]], dtype=float32),
             fixed_seed=1206330312, init_strategy='kmeans++', keep_data=False,
             max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4,
             oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05)
    tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10,
       ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
       weights=None)
    source: DataInMemory(data=[array([[ 0.0849984 ,  0.75388626],
           [ 0.76335831,  0.23260134],
           [ 0.20315025,  0.95727845],
           ..., 
           [ 0.82022961,  0.42732264],
           [ 0.95787558,  0.24492272],
           [ 0.40078893,  0.32759295]])], chunksize=16777216)


As you see, we can access all elements of the pipeline by the
data_producer attribute. In principle we can just assign these to
variables again and change estimation parameters and re-estimate parts
of the pipeline.

This concludes the storage tutorial of PyEMMA. Happy saving!