pyemma.coordinates.pca¶
-
pyemma.coordinates.
pca
(data=None, dim=- 1, var_cutoff=0.95, stride=1, mean=None, skip=0, chunksize=None, **kwargs)¶ Principal Component Analysis (PCA).
PCA is a linear transformation method that finds coordinates of maximal variance. A linear projection onto the principal components thus makes a minimal error in terms of variation in the data. Note, however, that this method is not optimal for Markov model construction because for that purpose the main objective is to preserve the slow processes which can sometimes be associated with small variance.
It estimates a PCA transformation from data. When input data is given as an argument, the estimation will be carried out right away, and the resulting object can be used to obtain eigenvalues, eigenvectors or project input data onto the principal components. If data is not given, this object is an empty estimator and can be put into a
pipeline()
in order to use PCA in streaming mode.- Parameters
data (ndarray (T, d) or list of ndarray (T_i, d) or a reader created by) – source function data array or list of data arrays. T or T_i are the number of time steps in a trajectory. When data is given, the PCA is immediately parametrized by estimating the covariance matrix and computing its eigenvectors.
dim (int, optional, default -1) – the number of dimensions (principal components) to project onto. A call to the
map
function reduces the d-dimensional input to only dim dimensions such that the data preserves the maximum possible variance amongst dim-dimensional linear projections. -1 means all numerically available dimensions will be used unless reduced by var_cutoff. Setting dim to a positive value is exclusive with var_cutoff.var_cutoff (float in the range [0,1], optional, default 0.95) – Determines the number of output dimensions by including dimensions until their cumulative kinetic variance exceeds the fraction subspace_variance. var_cutoff=1.0 means all numerically available dimensions (see epsilon) will be used, unless set by dim. Setting var_cutoff smaller than 1.0 is exclusive with dim
stride (int, optional, default = 1) – If set to 1, all input data will be used for estimation. Note that this could cause this calculation to be very slow for large data sets. Since molecular dynamics data is usually correlated at short timescales, it is often sufficient to estimate transformations at a longer stride. Note that the stride option in the get_output() function of the returned object is independent, so you can parametrize at a long stride, and still map all frames through the transformer.
mean (ndarray, optional, default None) – Optionally pass pre-calculated means to avoid their re-computation. The shape has to match the input dimension.
skip (int, default=0) – skip the first initial n frames per trajectory.
chunksize (int, default=None) – Number of data frames to process at once. Choose a higher value here, to optimize thread usage and gain processing speed. If None is passed, use the default value of the underlying reader/data source. Choose zero to disable chunking at all.
- Returns
pca – Object for Principle component analysis (PCA) analysis. It contains PCA eigenvalues and eigenvectors, and the projection of input data to the dominant PCA
- Return type
a
PCA
transformation object
Notes
Given a sequence of multivariate data \(X_t\), computes the mean-free covariance matrix.
\[C = (X - \mu)^T (X - \mu)\]and solves the eigenvalue problem
\[C r_i = \sigma_i r_i,\]where \(r_i\) are the principal components and \(\sigma_i\) are their respective variances.
When used as a dimension reduction method, the input data is projected onto the dominant principal components.
See Wiki page for more theory and references. for more theory and references.
-
class
pyemma.coordinates.transform.pca.
PCA
(*args, **kwargs)¶ Principal component analysis.
Methods
describe
()Get a descriptive string representation of this class.
output dimension
estimate
(X, **kwargs)Estimates the model given the data X
fit
(X[, y])Estimates parameters - for compatibility with sklearn.
fit_transform
(X[, y])Fit to data, then transform it.
get_output
([dimensions, stride, skip, chunk])Maps all input data of this transformer and returns it as an array or list of arrays
get_params
([deep])Get parameters for this estimator.
iterator
([stride, lag, chunk, …])creates an iterator to stream over the (transformed) data.
load
(file_name[, model_name])Loads a previously saved PyEMMA object from disk.
n_chunks
(chunksize[, stride, skip])how many chunks an iterator of this sourcde will output, starting (eg.
n_frames_total
([stride, skip])Returns total number of frames.
number_of_trajectories
([stride])Returns the number of trajectories.
output_type
()By default transformers return single precision floats.
partial_fit
(X)save
(file_name[, model_name, overwrite, …])saves the current state of this object to given file and name.
set_params
(**params)Set the parameters of this estimator.
trajectory_length
(itraj[, stride, skip])Returns the length of trajectory of the requested index.
trajectory_lengths
([stride, skip])Returns the length of each trajectory.
transform
(X)Maps the input data through the transformer to correspondingly shaped output data array/list.
write_to_csv
([filename, extension, …])write all data to csv with numpy.savetxt
write_to_hdf5
(filename[, group, …])writes all data of this Iterable to a given HDF5 file.
Attributes
-
property
cumvar
¶
-
property
data_producer
¶ The data producer for this data source object (can be another data source object). :returns: :rtype: This data source’s data producer.
-
describe
()¶ Get a descriptive string representation of this class.
-
dimension
()¶ output dimension
-
property
eigenvalues
¶
-
property
eigenvectors
¶
-
property
feature_PC_correlation
¶ Instantaneous correlation matrix between input features and PCs
Denoting the input features as \(X_i\) and the PCs as \(\theta_j\), the instantaneous, linear correlation between them can be written as
\[\mathbf{Corr}(X_i, \mathbf{\theta}_j) = \frac{1}{\sigma_{X_i}}\sum_l \sigma_{X_iX_l} \mathbf{U}_{li}\]The matrix \(\mathbf{U}\) is the matrix containing, as column vectors, the eigenvectors of the input-feature covariance-maxtrix.
- Returns
feature_PC_correlation – correlation matrix between input features and PCs. There is a row for each feature and a column for each PC.
- Return type
ndarray(n,m)
-
property
mean
¶
-
partial_fit
(X)¶
-
property
References
- 1
Pearson, K. 1901 On Lines and Planes of Closest Fit to Systems of Points in Space Phil. Mag. 2, 559–572
- 2
Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Edu. Psych. 24, 417-441 and 498-520.