pyemma.coordinates.pca

pyemma.coordinates.pca(data=None, dim=- 1, var_cutoff=0.95, stride=1, mean=None, skip=0, chunksize=None, **kwargs)

Principal Component Analysis (PCA).

PCA is a linear transformation method that finds coordinates of maximal variance. A linear projection onto the principal components thus makes a minimal error in terms of variation in the data. Note, however, that this method is not optimal for Markov model construction because for that purpose the main objective is to preserve the slow processes which can sometimes be associated with small variance.

It estimates a PCA transformation from data. When input data is given as an argument, the estimation will be carried out right away, and the resulting object can be used to obtain eigenvalues, eigenvectors or project input data onto the principal components. If data is not given, this object is an empty estimator and can be put into a pipeline() in order to use PCA in streaming mode.

Parameters
  • data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by) – source function data array or list of data arrays. T or T_i are the number of time steps in a trajectory. When data is given, the PCA is immediately parametrized by estimating the covariance matrix and computing its eigenvectors.

  • dim (int, optional, default -1) – the number of dimensions (principal components) to project onto. A call to the map function reduces the d-dimensional input to only dim dimensions such that the data preserves the maximum possible variance amongst dim-dimensional linear projections. -1 means all numerically available dimensions will be used unless reduced by var_cutoff. Setting dim to a positive value is exclusive with var_cutoff.

  • var_cutoff (float in the range [0,1], optional, default 0.95) – Determines the number of output dimensions by including dimensions until their cumulative kinetic variance exceeds the fraction subspace_variance. var_cutoff=1.0 means all numerically available dimensions (see epsilon) will be used, unless set by dim. Setting var_cutoff smaller than 1.0 is exclusive with dim

  • stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer.

  • mean (ndarray, optional, default None) – Optionally pass pre-calculated means to avoid their re-computation. The shape has to match the input dimension.

  • skip (int, default=0) – skip the first initial n frames per trajectory.

  • chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.

Returns

pca – Object for Principle component analysis (PCA) analysis. It contains PCA eigenvalues and eigenvectors, and the projection of input data to the dominant PCA

Return type

a PCA transformation object

Notes

Given a sequence of multivariate data \(X_t\), computes the mean-free covariance matrix.

\[C = (X - \mu)^T (X - \mu)\]

and solves the eigenvalue problem

\[C r_i = \sigma_i r_i,\]

where \(r_i\) are the principal components and \(\sigma_i\) are their respective variances.

When used as a dimension reduction method, the input data is projected onto the dominant principal components.

See Wiki page for more theory and references. for more theory and references.

See also

PCA : pca object

tica : for time-lagged independent component analysis

class pyemma.coordinates.transform.pca.PCA(*args, **kwargs)

Principal component analysis.

Methods

describe()

Get a descriptive string representation of this class.

dimension()

output dimension

estimate(X, **kwargs)

Estimates the model given the data X

fit(X[, y])

Estimates parameters - for compatibility with sklearn.

fit_transform(X[, y])

Fit to data, then transform it.

get_output([dimensions, stride, skip, chunk])

Maps all input data of this transformer and returns it as an array or list of arrays

get_params([deep])

Get parameters for this estimator.

iterator([stride, lag, chunk, …])

creates an iterator to stream over the (transformed) data.

load(file_name[, model_name])

Loads a previously saved PyEMMA object from disk.

n_chunks(chunksize[, stride, skip])

how many chunks an iterator of this sourcde will output, starting (eg.

n_frames_total([stride, skip])

Returns total number of frames.

number_of_trajectories([stride])

Returns the number of trajectories.

output_type()

By default transformers return single precision floats.

partial_fit(X)

save(file_name[, model_name, overwrite, …])

saves the current state of this object to given file and name.

set_params(**params)

Set the parameters of this estimator.

trajectory_length(itraj[, stride, skip])

Returns the length of trajectory of the requested index.

trajectory_lengths([stride, skip])

Returns the length of each trajectory.

transform(X)

Maps the input data through the transformer to correspondingly shaped output data array/list.

write_to_csv([filename, extension, …])

write all data to csv with numpy.savetxt

write_to_hdf5(filename[, group, …])

writes all data of this Iterable to a given HDF5 file.

Attributes

property cumvar
property data_producer

The data producer for this data source object (can be another data source object). :returns: :rtype: This data source’s data producer.

describe()

Get a descriptive string representation of this class.

dimension()

output dimension

property eigenvalues
property eigenvectors
property feature_PC_correlation

Instantaneous correlation matrix between input features and PCs

Denoting the input features as \(X_i\) and the PCs as \(\theta_j\), the instantaneous, linear correlation between them can be written as

\[\mathbf{Corr}(X_i, \mathbf{\theta}_j) = \frac{1}{\sigma_{X_i}}\sum_l \sigma_{X_iX_l} \mathbf{U}_{li}\]

The matrix \(\mathbf{U}\) is the matrix containing, as column vectors, the eigenvectors of the input-feature covariance-maxtrix.

Returns

feature_PC_correlation – correlation matrix between input features and PCs. There is a row for each feature and a column for each PC.

Return type

ndarray(n,m)

property mean
partial_fit(X)

References

1

Pearson, K. 1901 On Lines and Planes of Closest Fit to Systems of Points in Space Phil. Mag. 2, 559–572

2

Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Edu. Psych. 24, 417-441 and 498-520.