Principal Component Analysis reduces the dimensionality of a data set where a large number of interrelated—that is, correlated—variables exist, while retaining as much as possible the variation present in the data set. In mass spectrometry, the data set consists of the mass spectra of different compounds. The mass spectra are expressed as the intensities of individual m/z ratios, or variables.

PCA attempts to find a new coordinate system that can be expressed as the linear combination of the original variables (m/z) so that the major trends in the data are described. Mathematically, PCA relies on eigenvalue/eigenvector decomposition of the covariance or the correlation matrix of the original variables. PCA decomposes the data matrix X as the multiplication of two matrices P (the matrix of new coordinates of data points) and Tī (transposition of the coefficients matrix of the linear combination of the original variables):

X = P .Tī

Generally, the data can be adequately described using far fewer coordinates, also called principal components, than original variables. PCA also serves as a data-reduction method and a visualization tool. When the data points are plotted in the new coordinate system, the relationships and clusters are often more apparent than when the data points are plotted with the original coordinates.

Geometrical interpretation of PCA: The axes of the new coordinate system—principal components p1 and p2—are created as the linear combinations of the original axes. New coordinates (principal components (PC)) are orthogonal (perpendicular) to each other. There is greater variation in the direction of p1 than in either of the original variables, but very little variation in the direction of p2. For data sets with more than two variables, the first PC describes the direction of the greatest variation in the data set, the second PC describes the direction of the second greatest variation, and so on.