pyemma.coordinates.load¶
-
pyemma.coordinates.
load
(trajfiles, features=None, top=None, stride=1, chunksize=None, **kw)¶ Loads coordinate features into memory.
If your memory is not big enough consider the use of pipeline, or use the stride option to subsample the data.
Parameters: - trajfiles (str, list of str or nested list (one level) of str) –
A filename or a list of filenames to trajectory files that can be processed by pyemma. Both molecular dynamics trajectory files and raw data files (tabulated ASCII or binary) can be loaded.
- If a nested list of filenames is given, eg.:
- [[‘traj1_0.xtc’, ‘traj1_1.xtc’], ‘traj2_full.xtc’], [‘traj3_0.xtc, …]]
the grouped fragments will be treated as a joint trajectory.
When molecular dynamics trajectory files are loaded either a featurizer must be specified (for reading specific quantities such as distances or dihedrals), or a topology file (in that case only Cartesian coordinates will be read). In the latter case, the resulting feature vectors will have length 3N for each trajectory frame, with N being the number of atoms and (x1, y1, z1, x2, y2, z2, …) being the sequence of coordinates in the vector.
Molecular dynamics trajectory files are loaded through mdtraj (http://mdtraj.org/latest/), and can possess any of the mdtraj-compatible trajectory formats including:
- CHARMM/NAMD (.dcd)
- Gromacs (.xtc)
- Gromacs (.trr)
- AMBER (.binpos)
- AMBER (.netcdf)
- PDB trajectory format (.pdb)
- TINKER (.arc),
- MDTRAJ (.hdf5)
- LAMMPS trajectory format (.lammpstrj)
Raw data can be in the following format:
- tabulated ASCII (.dat, .txt)
- binary python (.npy, .npz)
- features (MDFeaturizer, optional, default = None) – a featurizer object specifying how molecular dynamics files should be read (e.g. intramolecular distances, angles, dihedrals, etc).
- top (str, mdtraj.Trajectory or mdtraj.Topology, optional, default = None) – A molecular topology file, e.g. in PDB (.pdb) format or an already loaded mdtraj.Topology object. If it is an mdtraj.Trajectory object, the topology will be extracted from it.
- stride (int, optional, default = 1) – Load only every stride’th frame. By default, every frame is loaded
- chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.
Returns: data – If a single filename was given as an input (and unless the format is .npz), the return will be a single ndarray of size (T, d), where T is the number of time steps in the trajectory and d is the number of features (coordinates, observables). When reading from molecular dynamics data without a specific featurizer, each feature vector will have size d=3N and will hold the Cartesian coordinates in the sequence (x1, y1, z1, x2, y2, z2, …). If multiple filenames were given, or if the file is a .npz holding multiple arrays, the result is a list of appropriately shaped arrays
Return type: ndarray or list of ndarray
See also
pyemma.coordinates.source()
- if your memory is not big enough, specify data source and put it into your transformation or clustering algorithms instead of the loaded data. This will stream the data and save memory on the cost of longer processing times.
Examples
>>> from pyemma.coordinates import load >>> files = ['traj01.xtc', 'traj02.xtc'] # doctest: +SKIP >>> output = load(files, top='my_structure.pdb') # doctest: +SKIP
- trajfiles (str, list of str or nested list (one level) of str) –