pyemma.coordinates.pca¶
-
pyemma.coordinates.
pca
(data=None, dim=2, stride=1)¶ Principal Component Analysis (PCA).
PCA is a linear transformation method that finds coordinates of maximal variance. A linear projection onto the principal components thus makes a minimal error in terms of variation in the data. Note, however, that this method is not optimal for Markov model construction because for that purpose the main objective is to preserve the slow processes which can sometimes be associated with small variance.
It estimates a PCA transformation from data. When input data is given as an argument, the estimation will be carried out right away, and the resulting object can be used to obtain eigenvalues, eigenvectors or project input data onto the principal components. If data is not given, this object is an empty estimator and can be put into a
pipeline()
in order to use PCA in streaming mode.Parameters: - data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by source function) – data array or list of data arrays. T or T_i are the number of time steps in a trajectory. When data is given, the PCA is immediately parametrized by estimating the covariance matrix and computing its eigenvectors.
- dim (int) – the number of dimensions (principal components) to project onto. A call to the
map
function reduces the d-dimensional input to only dim dimensions such that the data preserves the maximum possible variance amonst dim-dimensional linear projections. - stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer.
Returns: pca – Object for Principle component analysis (PCA) analysis. It contains PCA eigenvalues and eigenvectors, and the projection of input data to the dominant PCA
Return type: a
PCA
transformation objectNotes
Given a sequence of multivariate data \(X_t\), computes the mean-free covariance matrix.
\[C = (X - \mu)^T (X - \mu)\]and solves the eigenvalue problem
\[C r_i = \sigma_i r_i,\]where \(r_i\) are the principal components and \(\sigma_i\) are their respective variances.
When used as a dimension reduction method, the input data is projected onto the dominant principal components.
See Wiki page for more theory and references.
References
[1] Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Edu. Psych. 24, 417-441 and 498-520.