Save your models in PyEMMA ========================== Most of the Estimators and Models in PyEMMA are serializable. If a given Estimator or Model can be saved to disk, it provides a **save** method. In this notebook we will explain the basic concepts of file handling. We try our best to provide **future** compatiblity of already saved data. This means it should always be possible to load data with a newer version of the software, but you can not do reverse, eg. load a model saved by a new version with an old version of PyEMMA. If you are interested in the technical background, go ahead and read the source code (it is not that much actually). .. code:: ipython3 import pyemma import numpy as np import os import pprint pyemma.config.mute = True .. code:: ipython3 # delete all saved data def rm_models(): import glob for f in glob.glob('*.h5'): os.unlink(f) rm_models() .. code:: ipython3 # generate some artificial data with 10 states dtrajs = [np.random.randint(0, 10, size=10000) for _ in range(5)] .. code:: ipython3 # estimate a Bayesian Markov state model bmsm = pyemma.msm.bayesian_markov_model(dtrajs, lag=10) print(bmsm) .. parsed-literal:: BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective', dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100, nsteps=3, reversible=True, show_progress=False, sparse=False, statdist_constraint=None) We can now save the estimator (which contains the model) to disk. .. code:: ipython3 # now save our model bmsm.save('my_models.h5') We can now restore the model, by simply invoking pyemma.load function with our file name. .. code:: ipython3 pyemma.load('my_models.h5') .. parsed-literal:: BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective', dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100, nsteps=3, reversible=True, show_progress=False, sparse=False, statdist_constraint=None) Note that we can save multiple models in one file. Because HDF5 acts like a file system, we have each model in a separate “folder”, which is completely independent of the other models. We now change a parameter during estimation and save the estimator again in the same file, but in a different “folder”. .. code:: ipython3 bmsm.estimate(dtrajs, lag=100) print(bmsm) bmsm.save('my_models.h5', model_name='lag100') .. parsed-literal:: BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective', dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100, nsteps=3, reversible=True, show_progress=False, sparse=False, statdist_constraint=None) Likewise when we want to restore the model with the new name, we have to pass it to the load function accordingly. .. code:: ipython3 pyemma.load('my_models.h5', model_name='lag100') .. parsed-literal:: BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective', dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100, nsteps=3, reversible=True, show_progress=False, sparse=False, statdist_constraint=None) As you may have noted, there is no need to pass a model name. For convenience we always save under model_name “latest”, if the argument is not provided. To check which models are contained in a file, we provide a command line tool named “pyemma_list_models”. .. code:: ipython3 ! pyemma_list_models .. parsed-literal:: usage: pyemma_list_models [-h] [--json] [--recursive] [-v] files [files ...] pyemma_list_models: error: the following arguments are required: files .. code:: ipython3 ! pyemma_list_models my_models.h5 .. parsed-literal:: PyEMMA models ============= file: my_models.h5 -------------------------------------------------------------------------------- 1. name: default created: Tue Apr 10 18:52:26 2018 BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective', dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100, nsteps=3, reversible=True, show_progress=False, sparse=False, statdist_constraint=None) 2. name: lag100 created: Tue Apr 10 18:52:27 2018 BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective', dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100, nsteps=3, reversible=True, show_progress=False, sparse=False, statdist_constraint=None) -------------------------------------------------------------------------------- You can also check the list of already stored models directly in PyEMMA. .. code:: ipython3 content = pyemma.list_models('my_models.h5') print("available models:", content.keys()) print("-" * 80) print("detailed:") pprint.pprint(content) .. parsed-literal:: available models: dict_keys(['default', 'lag100']) -------------------------------------------------------------------------------- detailed: {'default': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', " "count_mode='effective',\n" " dt_traj='1 step', lag=10, " "mincount_connectivity='1/n', nsamples=100,\n" ' nsteps=3, reversible=True, ' 'show_progress=False, sparse=False,\n' ' statdist_constraint=None)', 'class_str': "BayesianMSM(conf=0.95, connectivity='largest', " "count_mode='effective',\n" " dt_traj='1 step', lag=10, " "mincount_connectivity='1/n', nsamples=100,\n" ' nsteps=3, reversible=True, ' 'show_progress=False, sparse=False,\n' ' statdist_constraint=None)', 'created': 1523379146.8274236, 'created_readable': 'Tue Apr 10 18:52:26 2018', 'digest': '88b20f42dcdd39947b4d0b04794b54695b84833eb408295338d41181f0cab46b', 'pyemma_version': '2.5.2', 'saved_streaming_chain': False}, 'lag100': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', " "count_mode='effective',\n" " dt_traj='1 step', lag=100, " "mincount_connectivity='1/n', nsamples=100,\n" ' nsteps=3, reversible=True, ' 'show_progress=False, sparse=False,\n' ' statdist_constraint=None)', 'class_str': "BayesianMSM(conf=0.95, connectivity='largest', " "count_mode='effective',\n" " dt_traj='1 step', lag=100, " "mincount_connectivity='1/n', nsamples=100,\n" ' nsteps=3, reversible=True, ' 'show_progress=False, sparse=False,\n' ' statdist_constraint=None)', 'created': 1523379147.449291, 'created_readable': 'Tue Apr 10 18:52:27 2018', 'digest': '23594166df674205f8dcb5ed9d5c7b64ac8ffd942e00fde91ba26c73cae6e7fe', 'pyemma_version': '2.5.2', 'saved_streaming_chain': False}} Overwriting existing models is also possible, but we have to tell the save method, that we want to overwrite. .. code:: ipython3 # we now expect that we get a failure, because the model already exists in the file. try: bmsm.save('my_models.h5') except RuntimeError as e: print("can not save:", e) .. parsed-literal:: can not save: model "default" already exists. Either use overwrite=True, or use a different name/file. .. code:: ipython3 bmsm.save('my_models.h5', overwrite=True) Save Pipelines -------------- In PyEMMA coordinates one often has chains of Estimators, eg. a reader followed by some transformations and finally a clustering. If you want to preserve this definition of data flow, you can set this during save. .. code:: ipython3 # create some data, note that this in principle could also be a FeatureReader used to process MD data. data = np.random.random((1000, 2)) from pyemma.coordinates import source, tica, cluster_kmeans reader = source(data) tica = tica(reader, lag=10) clust = cluster_kmeans(tica) print('clustering:', clust) print('tica:', tica) print('source:', reader) .. parsed-literal:: clustering: KmeansClustering(clustercenters=array([[-0.04867, -0.00542], [ 0.04776, -0.00615], [ 0.00866, 0.02063], [-0.00727, -0.02088], [-0.03013, 0.01751], [ 0.03868, 0.03487], [ 0.01011, -0.04399], [-0.07364, 0.0013 ], [ 0.07898, 0.00271], [ 0.0187 , -0.016... [-0.04348, 0.02851], [-0.0293 , -0.03614], [-0.02583, 0.03866]], dtype=float32), fixed_seed=1206330312, init_strategy='kmeans++', keep_data=False, max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4, oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05) tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10, ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95, weights=None) source: DataInMemory(data=[array([[ 0.0849984 , 0.75388626], [ 0.76335831, 0.23260134], [ 0.20315025, 0.95727845], ..., [ 0.82022961, 0.42732264], [ 0.95787558, 0.24492272], [ 0.40078893, 0.32759295]])], chunksize=16777216) The setting “save_streaming_chain” controls, if we want to save the input chain of the object being saved. .. code:: ipython3 clust.save('pipeline.h5', save_streaming_chain=True) The list models tools will also show the saved chain in a human readable fashion. .. code:: ipython3 ! pyemma_list_models pipeline.h5 .. parsed-literal:: PyEMMA models ============= file: pipeline.h5 -------------------------------------------------------------------------------- 1. name: default created: Tue Apr 10 18:52:30 2018 KmeansClustering(clustercenters=array([[-0.04867, -0.00542], [ 0.04776, -0.00615], [ 0.00866, 0.02063], [-0.00727, -0.02088], [-0.03013, 0.01751], [ 0.03868, 0.03487], [ 0.01011, -0.04399], [-0.07364, 0.0013 ], [ 0.07898, 0.00271], [ 0.0187 , -0.016... [-0.04348, 0.02851], [-0.0293 , -0.03614], [-0.02583, 0.03866]], dtype=float32), fixed_seed=1206330312, init_strategy='kmeans++', keep_data=False, max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4, oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05) ---------Input chain--------- 1. DataInMemory(data=[array([[ 0.0849984 , 0.75388626], [ 0.76335831, 0.23260134], [ 0.20315025, 0.95727845], ..., [ 0.82022961, 0.42732264], [ 0.95787558, 0.24492272], [ 0.40078893, 0.32759295]])], chunksize=16777216) 2. TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10, ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95, weights=None) -------------------------------------------------------------------------------- .. code:: ipython3 restored = pyemma.load('pipeline.h5') print('clustering:', restored) print('tica:', restored.data_producer) print('source:', restored.data_producer.data_producer) .. parsed-literal:: clustering: KmeansClustering(clustercenters=array([[-0.04867, -0.00542], [ 0.04776, -0.00615], [ 0.00866, 0.02063], [-0.00727, -0.02088], [-0.03013, 0.01751], [ 0.03868, 0.03487], [ 0.01011, -0.04399], [-0.07364, 0.0013 ], [ 0.07898, 0.00271], [ 0.0187 , -0.016... [-0.04348, 0.02851], [-0.0293 , -0.03614], [-0.02583, 0.03866]], dtype=float32), fixed_seed=1206330312, init_strategy='kmeans++', keep_data=False, max_iter=10, metric='euclidean', n_clusters=31, n_jobs=4, oom_strategy='memmap', skip=0, stride=1, tolerance=1e-05) tica: TICA(commute_map=False, dim=-1, epsilon=1e-06, kinetic_map=True, lag=10, ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95, weights=None) source: DataInMemory(data=[array([[ 0.0849984 , 0.75388626], [ 0.76335831, 0.23260134], [ 0.20315025, 0.95727845], ..., [ 0.82022961, 0.42732264], [ 0.95787558, 0.24492272], [ 0.40078893, 0.32759295]])], chunksize=16777216) As you see, we can access all elements of the pipeline by the data_producer attribute. In principle we can just assign these to variables again and change estimation parameters and re-estimate parts of the pipeline. This concludes the storage tutorial of PyEMMA. Happy saving!