This vignette is based upon LearnPCA
version
0.3.4.
LearnPCA
provides the following vignettes:
R
, simply type
browseVignettes("LearnPCA")
to get a clickable list in a
browser window.Vignettes are available in both pdf (on CRAN) and html formats (at Github).
We will work with three data sets here:
PCA is conducted on data sets composed of:
The purpose of PCA is data reduction, which we hope may lead to better insights into the data and to simpler models of that data. Data reduction refers to the goal of:
What does one get from PCA?
These plots will be explained further in the next section.3 Other things to know about PCA before going further:
This section is intended to illustrate the concepts of PCA, and how to interpret the plots that arise from PCA.
We’ll use a data set which reports chemical analyses for 13 elements on 180 archaeological glass artifacts from a study that hoped to determine the origin of the artifacts. The full data set consists of 180 rows and 13 columns; Table @ref(tab:dataTaste) gives a little bit of the data set.4
Na2O | MgO | Al2O3 | SiO2 | P2O5 | SO3 | Cl | K2O |
---|---|---|---|---|---|---|---|
13.904 | 2.244 | 1.312 | 67.752 | 0.884 | 0.052 | 0.936 | 3.044 |
14.194 | 2.184 | 1.310 | 67.076 | 0.938 | 0.024 | 0.966 | 3.396 |
14.668 | 3.034 | 1.362 | 63.254 | 0.988 | 0.064 | 0.886 | 2.828 |
14.800 | 2.455 | 1.385 | 63.790 | 1.200 | 0.115 | 0.988 | 2.878 |
14.078 | 2.480 | 1.072 | 68.768 | 0.682 | 0.070 | 0.966 | 2.402 |
We’ll perform PCA on the glass data set, show the three plots and then discuss them in turn. Figure @ref(fig:glassScree) shows the scree plot, Figure @ref(fig:glassScores) shows the scores plot and Figure @ref(fig:glassLoadings) shows the first loadings.
Scree plot from PCA on the glass data set.
Figure @ref(fig:glassScree), the scree plot, shows the amount of variance in the data set explained by, in this case, each of the first 10 principal components (PCs are along the x axis, from 1 to 10).5 Variance is a measure of the spread of points around the origin of whatever coordinate system is in use. Think of it as a measure of the scattering of the samples.6 To interpret this plot, we look for the point at which the height of the bars suddenly levels off. In this case, the first three PCs drop steadily downward, but from PC four and onward there is little additional variance that can be explained. We would say that three PCs are enough to explain this data set. In other words, the original 13 variables have been reduced to three, which is a great simplification.
Score plot from PCA on the glass data set.
In Figure @ref(fig:glassScores) one sees the scores for PC 1 plotted against the scores for PC 2. There are 180 points in this plot because there is one point per sample (put another way, every sample has a score value for PC 1 and for PC 2). This plot is interpreted by looking for clustering of samples, as well as for samples that are outliers, off by themselves. To our eyes there are 3 to 5 clusters here; none of the samples is an obvious outlier. Later we’ll discuss how we can explore this further.
We could also plot PC 1 against PC 3, or PC 2 against PC 3. These might show different clustering and separation of samples, but are not shown here. There wouldn’t be much point in plotting PC 4 or higher, as these are mostly noise, as established by the scree plot (Figure @ref(fig:glassScree)).
Loadings plot for PC 1 from PCA on the glass data set.
A loadings plot, Figure @ref(fig:glassLoadings), shows how much each measured variable contributes to one of the principal components and hence the separation of samples (in this case we show the loadings for PC 1). We see that three elements have large loadings, and the other elements contribute little to the separation. We would say separation along PC 1 is driven largely and collectively by the results for Na2O, SiO2 and CaO, which are the most abundant elements in most glasses.7 The first PC should be interpreted as a composite of these variables – these variables have been collapsed into one new variable, PC 1.
This ability to collapse correlated variables is a key part of PCA.
Table @ref(tab:elementCor) shows the correlations between these elements
in the raw glass
data set. We can see that the correlation
between Na2O and SiO2 is positive, but the
correlation between either of these elements and CaO is
negative. The loading plot, Figure @ref(fig:glassLoadings) reflects
this: Na2O and SiO2 contribute in the opposite
direction to CaO.8
Na2O | SiO2 | CaO | |
---|---|---|---|
Na2O | 1.00 | 0.45 | -0.58 |
SiO2 | 0.45 | 1.00 | -0.89 |
CaO | -0.58 | -0.89 | 1.00 |
Rather than relying on a scree plot to determine the number of PCs that are important, we can present the same information in a table, see Table @ref(tab:screeTable). A general rule of thumb says to keep enough PCs to account for 95% of the variance. The table leads us to the same conclusion as the scree plot: keep three PCs.
component | variance | cumulative |
---|---|---|
PC 1 | 64 | 64 |
PC 2 | 27 | 91 |
PC 3 | 7 | 98 |
PC 4 | 1 | 99 |
PC 5 | 0 | 100 |
PC 6 | 0 | 100 |
PC 7 | 0 | 100 |
PC 8 | 0 | 100 |
PC 9 | 0 | 100 |
PC 10 | 0 | 100 |
PC 11 | 0 | 100 |
PC 12 | 0 | 100 |
PC 13 | 0 | 100 |
The mathematics of PCA do not take into account anything about the samples other than the measured variables. However, the researcher may well know something about the samples, for instance, they may fall into groups based on their origin. If this is the case, the points on the score plot can be colored according to the group. This may aid significantly in the interpretation. Lucky for us, we can do this for the glass data set. The samples are known to come from four separate sites. We’ll re-do the score plot with colors corresponding to the known groups (Figure @ref(fig:glassScores2)).
Score plot from PCA on the glass data set, with groups color-coded.
With this figure, we can see that the large group in the lower left corner (in black), which to our eyes might have been two groups, is composed of related samples.
The archaeological glass data set has the advantage of only having a few variables, the percentages of the 13 elements in the glass artifacts. If we move to a spectroscopic data set, the number of variables goes up dramatically. A UV-Vis data set typically would have a few hundred to a thousand wavelength variables, an IR data set perhaps a few thousand data points, and a 1D NMR data set would typically have 16K or more data points. As far as PCA is concerned, in these cases the scree plot and score plot do not change in appearance or interpretation.
However, the loading plot changes appearance dramatically. This is because with hundreds to thousands of variables, one would not create a loading plot based on a bar chart (Figure @ref(fig:glassLoadings) is a bar chart). Instead, the loading plot with many variables looks like a spectrum! While the appearance is different, the interpretation is the same as for when there are only a few variables.
Let’s illustrate with an IR data set. We’ll use a data set included
with the ChemoSpec
package. This is a set of IR spectra of
plant oils which are mixtures of triglycerides (also called
triacylglyerols, which are esters of fatty acids), and free fatty acids.
Figure @ref(fig:IRSpectrum) shows a typical spectrum from the data
set.9
Spectrum 1 from the IR data set.
Next, we’ll carry out PCA as before, and show the scree plot (Figure @ref(fig:IRScree)) and the score plot (Figure @ref(fig:IRScores)). These appear much like the corresponding plots for the glass data set, and are interpreted in the same manner. In this case however, PC1 is pretty much all that is needed to understand the data set, a fact reflected in the scree plot and the comparatively small range of the scores along PC2 in the score plot.
However, the loadings plot, Figure @ref(fig:IRLoadings), looks a lot like a spectrum, because it has 1868 data points with a meaningful order—an organized set of wavenumbers—and is plotted as a connected scatter plot and not as a bar chart (which would be very difficult to read).
Scree plot from PCA on the IR data set.
Score plot from PCA on the IR data set.
Loadings plot for PC 1 from PCA on the IR data set.
Let’s zoom in on the carbonyl region of the loadings plot in detail. This region shows the contributions of various C=O (carbonyl) bonds in the structure. Figure @ref(fig:IRLoadings2) shows the original spectrum in red, for reference, and the loadings in black. One can see that the ester carbonyl peak around 1745 contributes positively to the first loading, while the carboxylic acid carbonyl peak at about 1705 contributes negatively.
Finally, to make the point that the loading plot for many variables is really the same as the loading plot for just a few variables, Figure @ref(fig:IRLoadings3) shows the carbonyl loadings as a bar plot with super narrow bars. If one connects the tips of the bars together, one gets the previous plot.10
Loadings plot for PC 1 from PCA on the IR data set, carbonyl region. Reference spectrum shown in red.
Loadings plot for PC 1 from PCA on the IR data set, carbonyl region, shown as a bar plot.
In addition to references and links in this document, please see the Works Consulted section of the Start Here vignette for general background.
Professor of Chemistry & Biochemistry, DePauw University, Greencastle IN USA., [email protected]↩︎
Professor Emeritus of Chemistry & Biochemistry, DePauw University, Greencastle IN USA., [email protected]↩︎
There is another plot, the “biplot”, which is sometimes encountered. This plot will be dealt with in a separate document.↩︎
This is the glass
data set in package
chemometrics
. The elements analyzed were Na2O,
MgO, Al2O3, SiO2,
P2O5, SO3, Cl, K2O, CaO,
MnO, Fe2O3, BaO, and PbO. With the exception of
chlorine, the elements are reported as their oxides; all values are
weight percents.↩︎
Because there are 13 variables, the most PCs one could have is 13. In theory, keeping all 13 PCs perfectly reproduces the original data set.↩︎
If the coordinate system is well-chosen, then the spead of points along an axis represents signal rather than noise.↩︎
If you knew this would be the result ahead of time, you probably would not have taken the time and expense to analyze the uninformative elements. However, we haven’t looked at PC 2 or PC 3 so this conclusion is premature.↩︎
If one were to look at the correlations between all the
elements in the glass
data set, one would find that other
elements correlate positively with Na2O, not just
SiO2. However, what PCA has done here is found the unique
pattern of these three elements tracking each other in the new
coordinate system, as seen in the loading plot.↩︎
Plots in this vignette are deliberately made rather
plain to focus on the data and to be consistent for ease-of-comparison.
If spectroscopy is your thing, package ChemoSpec
makes much
more polished plots.↩︎
One can also see here that the individual frequencies making up a peak are highly correlated, as they rise and fall together.↩︎