pyemma.coordinates.featurizer¶
-
pyemma.coordinates.
featurizer
(topfile)¶ Featurizer to select features from MD data.
- Parameters
topfile (str or mdtraj.Topology instance) – path to topology file (e.g pdb file) or a mdtraj.Topology object
- Returns
feat
- Return type
Examples
Create a featurizer and add backbone torsion angles to active features. Then use it in
source()
>>> import pyemma.coordinates # doctest: +SKIP >>> feat = pyemma.coordinates.featurizer('my_protein.pdb') # doctest: +SKIP >>> feat.add_backbone_torsions() # doctest: +SKIP >>> reader = pyemma.coordinates.source(["my_traj01.xtc", "my_traj02.xtc"], features=feat) # doctest: +SKIP
or
>>> traj = mdtraj.load('my_protein.pdb') # # doctest: +SKIP >>> feat = pyemma.coordinates.featurizer(traj.topology) # doctest: +SKIP
-
class
pyemma.coordinates.data.featurization.featurizer.
MDFeaturizer
(topfile, **kwargs)¶ Extracts features from MD trajectories.
Methods
add_all
([reference, atom_indices, …])Adds all atom coordinates to the feature list.
add_angles
(indexes[, deg, cossin, periodic])Adds the list of angles to the feature list
add_backbone_torsions
([selstr, deg, cossin, …])Adds all backbone phi/psi angles or the ones specified in
selstr
to the feature list.add_chi1_torsions
([selstr, deg, cossin, …])Adds all chi1 angles or the ones specified in
selstr
to the feature list.add_contacts
(indices[, indices2, threshold, …])Adds the contacts to the feature list.
add_custom_feature
(feature)Adds a custom feature to the feature list.
add_custom_func
(func, dim, *args, **kwargs)adds a user defined function to extract features
add_dihedrals
(indexes[, deg, cossin, periodic])Adds the list of dihedrals to the feature list
add_distances
(indices[, periodic, indices2])Adds the distances between atoms to the feature list.
add_distances_ca
([periodic, excluded_neighbors])Adds the distances between all Ca’s to the feature list.
add_group_COM
(group_definitions[, ref_geom, …])Adds the centers of mass (COM) in cartesian coordinates of a group or groups of atoms.
add_group_mindist
(group_definitions[, …])Adds the minimum distance between groups of atoms to the feature list.
add_inverse_distances
(indices[, periodic, …])Adds the inverse distances between atoms to the feature list.
add_minrmsd_to_ref
(ref[, ref_frame, …])Adds the minimum root-mean-square-deviation (minrmsd) with respect to a reference structure to the feature list.
add_residue_COM
(residue_indices[, scheme, …])Adds a per-residue center of mass (COM) in cartesian coordinates.
add_residue_mindist
([residue_pairs, scheme, …])Adds the minimum distance between residues to the feature list.
add_selection
(indexes[, reference, …])Adds the coordinates of the selected atom indexes to the feature list.
add_sidechain_torsions
([selstr, deg, …])Adds all side chain torsion angles or the ones specified in
selstr
to the feature list.describe
()Returns a list of strings, one for each feature selected, with human-readable descriptions of the features.
current dimension due to selected features
load
(file_name[, model_name])Loads a previously saved PyEMMA object from disk.
pairs
(sel[, excluded_neighbors])Creates all pairs between indexes.
Remove all instances of CustomFeature from the active feature list.
save
(file_name[, model_name, overwrite, …])saves the current state of this object to given file and name.
select
(selstring)Returns the indexes of atoms matching the given selection
Returns the indexes of backbone C, CA and N atoms
Returns the indexes of all Ca-atoms
select_Heavy
([exclude_symmetry_related])Returns the indexes of all heavy atoms (Mass >= 2), optionally excluding symmetry-related heavy atoms.
transform
(traj)Maps an mdtraj Trajectory object to the selected output features
Attributes
The logger for this class instance
The name of this instance
-
add_all
(reference=None, atom_indices=None, ref_atom_indices=None)¶ Adds all atom coordinates to the feature list. The coordinates are flattened as follows: [x1, y1, z1, x2, y2, z2, …]
- Parameters
reference (mdtraj.Trajectory or None, default=None) – if given, all data is being aligned to the given reference with Trajectory.superpose
atom_indices (array_like, or None) – The indices of the atoms to superpose. If not supplied, all atoms will be used.
ref_atom_indices (array_like, or None) – Use these atoms on the reference structure. If not supplied, the same atom indices will be used for this trajectory and the reference one.
-
add_angles
(indexes, deg=False, cossin=False, periodic=True)¶ Adds the list of angles to the feature list
- Parameters
indexes (np.ndarray, shape=(num_pairs, 3), dtype=int) – an array with triplets of atom indices
deg (bool, optional, default = False) – If False (default), angles will be computed in radians. If True, angles will be computed in degrees.
cossin (bool, optional, default = False) – If True, each angle will be returned as a pair of (sin(x), cos(x)). This is useful, if you calculate the mean (e.g TICA/PCA, clustering) in that space.
periodic (bool, optional, default = True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
-
add_backbone_torsions
(selstr=None, deg=False, cossin=False, periodic=True)¶ Adds all backbone phi/psi angles or the ones specified in
selstr
to the feature list.- Parameters
selstr (str, optional, default = "") – selection string specifying the atom selection used to specify a specific set of backbone angles If “” (default), all phi/psi angles found in the topology will be computed
deg (bool, optional, default = False) – If False (default), angles will be computed in radians. If True, angles will be computed in degrees.
cossin (bool, optional, default = False) – If True, each angle will be returned as a pair of (sin(x), cos(x)). This is useful, if you calculate the mean (e.g TICA/PCA, clustering) in that space.
periodic (bool, optional, default = True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
-
add_chi1_torsions
(selstr='', deg=False, cossin=False, periodic=True)¶ Adds all chi1 angles or the ones specified in
selstr
to the feature list.- Parameters
selstr (str, optional, default = "") – selection string specifying the atom selection used to specify a specific set of backbone angles If “” (default), all chi1 angles found in the topology will be computed
deg (bool, optional, default = False) – If False (default), angles will be computed in radians. If True, angles will be computed in degrees.
cossin (bool, optional, default = False) – If True, each angle will be returned as a pair of (sin(x), cos(x)). This is useful, if you calculate the mean (e.g TICA/PCA, clustering) in that space.
periodic (bool, optional, default = True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
-
add_contacts
(indices, indices2=None, threshold=0.3, periodic=True, count_contacts=False)¶ Adds the contacts to the feature list.
- Parameters
indices (can be of two types:) –
- ndarray((n, 2), dtype=int):
n x 2 array with the pairs of atoms between which the contacts shall be computed
- iterable of integers (either list or ndarray(n, dtype=int)):
indices (not pairs of indices) of the atoms between which the contacts shall be computed.
indices2 (iterable of integers (either list or ndarray(n, dtype=int)), optional:) – Only has effect if
indices
is an iterable of integers. Instead of the above behaviour, only the contacts between the atoms inindices
andindices2
will be computed.threshold (float, optional, default = .3) – distances below this threshold (in nm) will result in a feature 1.0, distances above will result in 0.0. The default is set to .3 nm (3 Angstrom)
periodic (boolean, default True) – use the minimum image convention if unitcell information is available
count_contacts (boolean, default False) – If set to true, this feature will return the number of formed contacts (and not feature values with either 1.0 or 0) The ouput of this feature will be of shape (Nt,1), and not (Nt, nr_of_contacts)
Note
When using the iterable of integers input,
indices
andindices2
will be sorted numerically and made unique before converting them to a pairlist. Please look carefully at the output ofdescribe()
to see what features exactly have been added.
-
add_custom_feature
(feature)¶ Adds a custom feature to the feature list.
- Parameters
feature (object) – an object with interface like CustomFeature (map, describe methods)
-
add_custom_func
(func, dim, *args, **kwargs)¶ adds a user defined function to extract features
- Parameters
func (function) – a user-defined function, which accepts mdtraj.Trajectory object as first parameter and as many optional and named arguments as desired. Has to return a numpy.ndarray ndim=2.
dim (int) – output dimension of
function
description (str or None) – a message for the describe feature list.
args (any number of positional arguments) – these have to be in the same order as
func
is expecting themkwargs (dictionary) – named arguments passed to func
Notes
You can pass a description list to describe the output of your function by element, by passing a list of strings with the same lengths as dimensions. Alternatively a single element list or str will be expanded to match the output dimension.
-
add_dihedrals
(indexes, deg=False, cossin=False, periodic=True)¶ Adds the list of dihedrals to the feature list
- Parameters
indexes (np.ndarray, shape=(num_pairs, 4), dtype=int) – an array with quadruplets of atom indices
deg (bool, optional, default = False) – If False (default), angles will be computed in radians. If True, angles will be computed in degrees.
cossin (bool, optional, default = False) – If True, each angle will be returned as a pair of (sin(x), cos(x)). This is useful, if you calculate the mean (e.g TICA/PCA, clustering) in that space.
periodic (bool, optional, default = True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
-
add_distances
(indices, periodic=True, indices2=None)¶ Adds the distances between atoms to the feature list.
- Parameters
indices (can be of two types:) –
- ndarray((n, 2), dtype=int):
n x 2 array with the pairs of atoms between which the distances shall be computed
- iterable of integers (either list or ndarray(n, dtype=int)):
indices (not pairs of indices) of the atoms between which the distances shall be computed.
periodic (optional, boolean, default is True) – If periodic is True and the trajectory contains unitcell information, distances will be computed under the minimum image convention.
indices2 (iterable of integers (either list or ndarray(n, dtype=int)), optional:) – Only has effect if
indices
is an iterable of integers. Instead of the above behaviour, only the distances between the atoms inindices
andindices2
will be computed.
Note
When using the iterable of integers input,
indices
andindices2
will be sorted numerically and made unique before converting them to a pairlist. Please look carefully at the output ofdescribe()
to see what features exactly have been added.
-
add_distances_ca
(periodic=True, excluded_neighbors=2)¶ Adds the distances between all Ca’s to the feature list.
- Parameters
periodic (boolean, default is True) – Use the minimum image convetion when computing distances
excluded_neighbors (int, default is 2) – Number of exclusions when compiling the list of pairs. Two CA-atoms are considered neighbors if they belong to adjacent residues.
-
add_group_COM
(group_definitions, ref_geom=None, image_molecules=False, mass_weighted=True)¶ Adds the centers of mass (COM) in cartesian coordinates of a group or groups of atoms. If these group definitions coincide directly with residues, use
add_residue_COM
instead. No periodic boundaries are taken into account.- Parameters
group_definitions (iterable of integers) – List of the groups of atom indices for which the COM will be computed. The atoms are zero-indexed.
ref_geom (
mdtraj.Trajectory
, default is None) – The coordinates can be centered to a reference geometry before computing the COM.image_molecules (boolean, default is False) – The method traj.image_molecules will be called before computing averages. The method tries to correct for molecules broken across periodic boundary conditions, but can be time consuming. See http://mdtraj.org/latest/api/generated/mdtraj.Trajectory.html#mdtraj.Trajectory.image_molecules for more details
mass_weighted (boolean, default is True) – Set to False if you want the geometric center and not the COM
Note
Centering (with
ref_geom
) and imaging (image_molecules=True
) the trajectories can sometimes be time consuming. Consider doing that to your trajectory-files prior to the featurization.
-
add_group_mindist
(group_definitions, group_pairs='all', threshold=None, periodic=True)¶ Adds the minimum distance between groups of atoms to the feature list. If the groups of atoms are identical to residues, use
add_residue_mindist
.- Parameters
group_definitions (list of 1D-arrays/iterables containing the group definitions via atom indices.) – If there is only one group_definition, it is assumed the minimum distance within this group (excluding the self-distance) is wanted. In this case,
group_pairs
is ignored.group_pairs (Can be of two types:) –
- ‘all’
Computes minimum distances between all pairs of groups contained in the group definitions
- ndarray((n, 2), dtype=int):
n x 2 array with the pairs of groups for which the minimum distances will be computed.
threshold (float, optional, default is None) – distances below this threshold (in nm) will result in a feature 1.0, distances above will result in 0.0. If left to None, the numerical value will be returned
periodic (bool, optional, default = True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
-
add_inverse_distances
(indices, periodic=True, indices2=None)¶ Adds the inverse distances between atoms to the feature list.
- Parameters
indices (can be of two types:) –
- ndarray((n, 2), dtype=int):
n x 2 array with the pairs of atoms between which the inverse distances shall be computed
- iterable of integers (either list or ndarray(n, dtype=int)):
indices (not pairs of indices) of the atoms between which the inverse distances shall be computed.
periodic (optional, boolean, default is True) – If periodic is True and the trajectory contains unitcell information, distances will be computed under the minimum image convention.
indices2 (iterable of integers (either list or ndarray(n, dtype=int)), optional:) – Only has effect if
indices
is an iterable of integers. Instead of the above behaviour, only the inverse distances between the atoms inindices
andindices2
will be computed.
Note
When using the iterable of integers input,
indices
andindices2
will be sorted numerically and made unique before converting them to a pairlist. Please look carefully at the output ofdescribe()
to see what features exactly have been added.
-
add_minrmsd_to_ref
(ref, ref_frame=0, atom_indices=None, precentered=False)¶ Adds the minimum root-mean-square-deviation (minrmsd) with respect to a reference structure to the feature list.
- Parameters
ref –
Reference structure for computing the minrmsd. Can be of two types:
mdtraj.Trajectory
objectfilename for mdtraj to load. In this case, only the
ref_frame
of that file will be used.
ref_frame (integer, default=0) – Reference frame of the filename specified in
ref
. This parameter has no effect ifref
is not a filename.atom_indices (array_like, default=None) –
Atoms that will be used for:
aligning the target and reference geometries.
computing rmsd after the alignment.
If left to None, all atoms of
ref
will be used.precentered (bool, default=False) – Use this boolean at your own risk to let mdtraj know that the target conformations are already centered at the origin, i.e., their (uniformly weighted) center of mass lies at the origin. This will speed up the computation of the rmsd.
-
add_residue_COM
(residue_indices, scheme='all', ref_geom=None, image_molecules=False, mass_weighted=True)¶ Adds a per-residue center of mass (COM) in cartesian coordinates. No periodic boundaries are taken into account.
- Parameters
residue_indices (iterable of integers) – The residue indices for which the COM will be computed. These are always zero-indexed that are not necessarily the residue sequence record of the topology (resSeq). resSeq indices start at least at 1 but can depend on the topology. See http://mdtraj.org/latest/atom_selection.html for more details.
scheme (str, default is 'all') – What atoms contribute to the COM computation. The supported keywords are: ‘all’, ‘backbone’, ‘sidechain’ . If the scheme yields no atoms for some residue, the selection falls back to ‘all’ for that residue.
ref_geom (obj:
mdtraj.Trajectory
, default is None) – The coordinates can be centered to a reference geometry before computing the COM.image_molecules (boolean, default is False) – The method traj.image_molecules will be called before computing averages. The method tries to correct for molecules broken across periodic boundary conditions, but can be time consuming. See http://mdtraj.org/latest/api/generated/mdtraj.Trajectory.html#mdtraj.Trajectory.image_molecules for more details
mass_weighted (boolean, default is True) – Set to False if you want the geometric center and not the COM
Note
Centering (with
ref_geom
) and imaging (image_molecules=True
) the trajectories can sometimes be time consuming. Consider doing that to your trajectory-files prior to the featurization.
-
add_residue_mindist
(residue_pairs='all', scheme='closest-heavy', ignore_nonprotein=True, threshold=None, periodic=True)¶ Adds the minimum distance between residues to the feature list. See below how the minimum distance can be defined. If the topology generated out of
topfile
contains information on periodic boundary conditions, the minimum image convention will be used when computing distances.- Parameters
residue_pairs (can be of two types:) –
- ‘all’
Computes distances between all pairs of residues excluding first and second neighbors
- ndarray((n, 2), dtype=int):
n x 2 array with the pairs residues for which distances will be computed
scheme ('ca', 'closest', 'closest-heavy', default is closest-heavy) – Within a residue, determines the sub-group atoms that will be considered when computing distances
ignore_nonprotein (boolean, default True) – Ignore residues that are not of protein type (e.g. water molecules, post-traslational modifications etc)
threshold (float, optional, default is None) – distances below this threshold (in nm) will result in a feature 1.0, distances above will result in 0.0. If left to None, the numerical value will be returned
periodic (bool, optional, default = True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
Note
Using
scheme
= ‘closest’ or ‘closest-heavy’ withresidue pairs
= ‘all’ will compute nearly all interatomic distances, for every frame, before extracting the closest pairs. This can be very time consuming. Those schemes are intended to be used with a subset of residues chosen viaresidue_pairs
.
-
add_selection
(indexes, reference=None, atom_indices=None, ref_atom_indices=None)¶ Adds the coordinates of the selected atom indexes to the feature list. The coordinates of the selection [1, 2, …] are flattened as follows: [x1, y1, z1, x2, y2, z2, …]
- Parameters
indexes (ndarray((n), dtype=int)) – array with selected atom indexes
reference (mdtraj.Trajectory or None, default=None) – if given, all data is being aligned to the given reference with Trajectory.superpose
atom_indices (array_like, or None) – The indices of the atoms to superpose. If not supplied, all atoms will be used.
ref_atom_indices (array_like, or None) – Use these atoms on the reference structure. If not supplied, the same atom indices will be used for this trajectory and the reference one.
-
add_sidechain_torsions
(selstr=None, deg=False, cossin=False, periodic=True, which='all')¶ Adds all side chain torsion angles or the ones specified in
selstr
to the feature list.- Parameters
selstr (str, optional, default=None) – selection string specifying the atom selection used to specify a specific set of backbone angles If “” (default), all chi1 angles found in the topology will be computed
deg (bool, optional, default=False) – If False (default), angles will be computed in radians. If True, angles will be computed in degrees.
cossin (bool, optional, default=False) – If True, each angle will be returned as a pair of (sin(x), cos(x)). This is useful, if you calculate the mean (e.g TICA/PCA, clustering) in that space.
periodic (bool, optional, default=True) – If periodic is True and the trajectory contains unitcell information, we will treat dihedrals that cross periodic images using the minimum image convention.
which (str or list of str, default='all') – one or combination of (‘all’, ‘chi1’, ‘chi2’, ‘chi3’, ‘chi4’, ‘chi5’)
-
describe
()¶ Returns a list of strings, one for each feature selected, with human-readable descriptions of the features.
- Returns
labels – An ordered list of strings, one for each feature selected, with human-readable descriptions of the features.
- Return type
list of str
-
dimension
()¶ current dimension due to selected features
- Returns
dim – total dimension due to all selection features
- Return type
int
-
classmethod
load
(file_name, model_name='default')¶ Loads a previously saved PyEMMA object from disk.
- Parameters
file_name (str or file like object (has to provide read method)) – The file like object tried to be read for a serialized object.
model_name (str, default='default') – if multiple models are contained in the file, these can be accessed by their name. Use
pyemma.list_models()
to get a representation of all stored models.
- Returns
obj
- Return type
the de-serialized object
-
logger
¶ The logger for this class instance
-
name
¶ The name of this instance
-
static
pairs
(sel, excluded_neighbors=0)¶ Creates all pairs between indexes. Will exclude closest neighbors up to
excluded_neighbors
The self-pair (i,i) is always excluded- Parameters
sel (ndarray((n), dtype=int)) – array with selected atom indexes
excluded_neighbors (int, default = 0) – number of neighbors that will be excluded when creating the pairs
- Returns
sel – m x 2 array with all pair indexes between different atoms that are at least
excluded_neighbors
indexes apart, i.e. if i is the index of an atom, the pairs [i,i-2], [i,i-1], [i,i], [i,i+1], [i,i+2], will not be insel
(n=excluded_neighbors) ifexcluded_neighbors
= 2. Moreover, the list is non-redundant,i.e. if [i,j] is in sel, then [j,i] is not.- Return type
ndarray((m,2), dtype=int)
-
remove_all_custom_funcs
()¶ Remove all instances of CustomFeature from the active feature list.
-
save
(file_name, model_name='default', overwrite=False, save_streaming_chain=False)¶ saves the current state of this object to given file and name.
- Parameters
file_name (str) – path to desired output file
model_name (str, default='default') – creates a group named ‘model_name’ in the given file, which will contain all of the data. If the name already exists, and overwrite is False (default) will raise a RuntimeError.
overwrite (bool, default=False) – Should overwrite existing model names?
save_streaming_chain (boolean, default=False) – if True, the data_producer(s) of this object will also be saved in the given file.
Examples
>>> import pyemma, numpy as np >>> from pyemma.util.contexts import named_temporary_file >>> m = pyemma.msm.MSM(P=np.array([[0.1, 0.9], [0.9, 0.1]]))
>>> with named_temporary_file() as file: # doctest: +SKIP ... m.save(file, 'simple') # doctest: +SKIP ... inst_restored = pyemma.load(file, 'simple') # doctest: +SKIP >>> np.testing.assert_equal(m.P, inst_restored.P) # doctest: +SKIP
-
select
(selstring)¶ Returns the indexes of atoms matching the given selection
- Parameters
selstring (str) – Selection string. See mdtraj documentation for details: http://mdtraj.org/latest/atom_selection.html
- Returns
indexes – array with selected atom indexes
- Return type
ndarray((n), dtype=int)
-
select_Backbone
()¶ Returns the indexes of backbone C, CA and N atoms
- Returns
indexes – array with selected atom indexes
- Return type
ndarray((n), dtype=int)
-
select_Ca
()¶ Returns the indexes of all Ca-atoms
- Returns
indexes – array with selected atom indexes
- Return type
ndarray((n), dtype=int)
-
select_Heavy
(exclude_symmetry_related=False)¶ Returns the indexes of all heavy atoms (Mass >= 2), optionally excluding symmetry-related heavy atoms.
- Parameters
exclude_symmetry_related (boolean, default=False) – if True, exclude symmetry-related heavy atoms.
- Returns
indexes – array with selected atom indexes
- Return type
ndarray((n), dtype=int)
-
topologyfile
¶
-
transform
(traj)¶ Maps an mdtraj Trajectory object to the selected output features
- Parameters
traj (mdtraj Trajectory) – Trajectory object used as an input
- Returns
out – Output features: For each of T time steps in the given trajectory, a vector with all n output features selected.
- Return type
ndarray((T, n), dtype=float32)
-