This document is intended to give an intuitive sense of how PCA works, and why certain pre-processing steps are necessary. Many implementations of PCA provide options to perform these pre-processing steps.
All analysis here is performed in R, with the help of several packages.
I will be using the mtcars dataset built-in to R. For our purposes, we will only be looking at 3 variables: weight (wt
), horsepower (hp
), and fuel efficiency mpg
. Here is what they look like.
Let’s assume that the goal of this is to create a model that predicts mpg
based on weight
and power
.
Keeping the goal in mind, one approach would be to create a simple linear regression.
\(y = mx + b\)
Where:
mpg
weight
or power
)Since we have 2 predictors (weight
and power
), then we could make 2 separate linear regressions, and choose whichever one has a better fit to be our predictor.
Between these two regressions (click the tabs), weight looks like it may be a better predictor than power based on \(r^2\).
Wouldn’t it be nice to have a model that creates a regression on both at the same time?
Multilinear regression (MLR) does exactly this. The formula in our case would be: \(y=m_1x_1 + m_2x_2 + b\)
Where:
mpg
weight
weight
power
power
After performing this fit, here is the resulting MLR.
y =-3.59*weight + -0.03*hp + 36.59
r squared = 0.8
We can see that this is a better model than either of the individual linear regressions.
But imagine if you have a very large number of predictors. There would have to be a term for each one. This is one of many reasons why we may want to perform dimension reduction. This will reduce the number of predictors so that the regression model is simplified.
Takeaways
Although our dataset already has very few predictors (2), it can be used to illustrate the logic of dimension reduction very well.
So, lets go about reducing the number of predictors from 2 to 1.
One way that we can see the correlation is with least-squares regression. This is what we generally mean when we say “linear regression”, and is the method that was used in the previous section on MLR.
Least-squares regression finds the line that minimizes the magnitude of the (squared) errors. But is it important to note that it is defining errors based only on the \(y\) values, i.e. the difference between the actual \(y\) values and the predicted \(y\) values. These errors can be visualized as vertical lines.
While the fact that least-squares regression is based on “vertical” errors may seem subtle, it can have certain implications, as we will see later.
Getting back to the objective: reducing our 2 predictors to 1.
This may not be an immediately obvious choice, but what if we converted our 2 predictors into 1 by taking the regression line to be a new axis.
Each point could then be represented by a single number: how far along the regression line that point lies.
To see how far along the regression line a point lies, we can draw a line segment that is perpendicular to the regression line, connecting the point to the regression line.
Before we try this with our dataset, let me illustrate the idea with some made up data.
In this case, we could ignore our original 2 variables and instead use a simplified variable: the distance along the regression line where each point projects, as indicated by the blue lines. Since the original variables (\(x\) and \(y\)) were correlated, this new variable still captures the information pretty well.
One thing to note though is that not all information is preserved. For instance the points at \((1.28, 1.95)\) and \((1.63, 1.54)\) project to nearly the same point on the regression line. So with just this 1 dimension, the difference between these points is greatly reduced.
With this ideal example in mind, let’s do the same to our car data. Let’s add perpendicular lines connecting the regression line and the points to illustrate the new axis.
You may think, “Those are not perpendicular to the regression line!”. They actually are, but because the scale on the 2 axes are so different, they don’t look like it.
We can visualize the same plot with both axes to the same scale, but then we can’t even see the lines. The magnitude of our vertical axis is just too much greater than our horizontal axis.
Looking at the last plot, we can see that if we use this technique, then when we project the points to the regression line, the variable with smaller magnitude doesn’t really seem to matter. Effectively, our new axis based on the regression line is pretty much the same as just using the power
or \(y\) axis as our predictor.
The reality of this can be seen in this way. The regression line and vertical axis are pretty much parallel. So, when projecting to the regression line, it is about the same as projecting just to the vertical axis, and ignoring the other variable (weight
).
So if we were to use this regression line as our new axis for dimension reduction, we would pretty much just be taking our new variable to be the power
and ignore the weight
variable.
While the above commentary may make it seem like this approach is doomed, it actually shows something very different. Yes, it is true that this method is sensitive to the scale of the predictors, but the scale is actually quite arbitrary.
Let’s look at the weight
axis in the above plots. As noted, these are in units of 1000 lb. So a weight of 5 is actually 5000 lb, for example.
I can easily adjust my data so that the \(x\) axis is in units of lb, which is completely valid. But this certainly affects the scale of the axis, and also our perception of the projection to the regression line.
Now, instead of the lines that are perpendicular to the regression line looking horizontal, they look vertical. This is because the \(x\) axis now has greater magnitude.
Additionally, the regression line is more parallel to the \(x\) axis. So if we were to project onto this regression line, our new variable would mostly reflect the weight
as a predictor.
Takeaways:
The reason that when the fake data were projected to the new axis (regression line) worked so well is that I intentionally gave each variable a similar scale (0 to 5).
So, in order to make this proposed method of dimension reduction work, we need to ensure that each variable is equally scaled.We can rest assured that this is okay, since we have already demonstrated that scale is arbitrary (\(1000\ lb\) vs \(lb\)).
We could, for instance, pick a value for each variable by which we would divide each value. By choosing the right number for each, they could have the same scale. But how would you choose this value? Trial and error? Spray and pray?
There is actually a more robust method of rescaling. We can divide the values of each variable by that variable’s standard deviation. If the values of each variable are at least kind of normally distributed, they will have similar scale. Let’s take a look at the results of this.
Now that the data are rescaled, the projections to the regression line are looking good!
Going back to the overall goal: we are trying to create a regression model to predict mpg
based on a single variable that captures the information of the 2 original variables weight
and power
.
To do this, I will create a new variable which I will call reg
, which is the distance along the regression line where each point projects. This can be seen as the where the blue lines touch the regression line in the plot above.
For now, I am doing this based on geometry (thank you Pythagoras).
So now, each point is represented by the single variable reg
. We can plot this variable against mpg
since this is the variable we want to predict based on our new single predictor variable, reg
.
We can now perform a linear regression on the above plot to see how well our new variable predicts fuel efficiency
.
It is actually impressive! We have reduced the number of predictor variables, yet the correlation is the same as or stronger than:
power
(\(r^2=0.625\))weight
(\(r^2=0.703\))weight
and hp
(\(r^2=0.8\))Of course, this is only a single observation, so we can’t be sure if the difference is significant. Still, it is interesting.
Takeaways
To review: our new variable reg
is the projection of our data onto a linear regression line of the re-scaled weight
and hp
variables. This regression was performed using ordinary least-squares, which minimizes the errors in the \(y\) direction.
So far, we have only looked at the single new variable, represented by the projection. But even in this new coordinate system, we can look at it as having 2 dimensions:
The new \(y\) axis coordinate can be seen as the distance a point is from the new \(x\) axis, the regression line. These distances are shown in blue here.
We can plot the same points in this new rotated coordinate system.
I will now refer to the regression axis as reg1
and the perpendicular axis as reg2
.
It may (or may not) seem odd that these new variables are still correlated!
The strangeness may be more apparent when you think about why we took this approach in the first place. We realized our variables were correlated, so a projection to a regression line summarized both variables pretty well.
Since they’re still correlated, we could do this same process again. And again..
Clearly, these variables are less correlated though than the original variables (weight
and hp
). We would expect this to be the case because the regression on the original variables should have had points distributed evenly above and below the regression line. After all, isn’t that the whole idea of regression?
So given that last fact, why are they still correlated?
It comes down to how we performed the regression. As mentioned, we used ordinary least-squares regression, which minimizes the vertical errors. By contrast, it did not minimize the real (perpendicular) distances from the regression line.
This is problematic in a few ways:
One way around this is to perform a different kind of linear regression. Orthogonal (a.k.a total least-square or Deming) regression minimizes the errors orthogonal (perpendicular) to the regression line.
Below, I have shown both types of regression. The errors which are minimized for each method are shown in blue. I have also performed the projection to the different regression lines, and plotted the data in the new coordinate system.
It is apparent that the regression lines are very different in each case.
We can see that the new coordinates based on orthogonal regression are uncorrelated as indicated by the horizontal regression line and \(r^2=0\). The variables based on ordinary regression are only weakly correlated, but the regression line is obviously not horizontal.
Now that we have been able to find a better regression model to project to, let’s see how this new variable predicts fuel efficiency. For the regressions shown below ordinary least squares regression is used at this point, since we are trying to predict fuel efficiency based on our new variable. ortho and ordinary in the titles indicate the regression method used for generating the independent variable.
The correlation seems to be about the same as the variable made from ordinary regression projection. While there is no obvious improvement, all the previous commentary on the benefit of orthogonal regression still holds.
Takeaways
So far we have explored 2 methods of finding a new axis to which we can project:
We began with regressions since these intuitively follow directions of variance. However, a more efficient method is to find the eigenvectors of the covariance matrix between the variables. These eigenvectors will serve as the new axes for data projection, similar to how the regression lines were used.
This process is effectively the same as the orthogonal regression projection, but is much simpler to perform, since matrix operations are more concise than performing regression and projection based on geometries.
Since it is pretty much the same process, we will still be computing on the scaled data, as this method is still sensitive to variable magnitude.
Firstly, we compute the covariance matrix of our variables. For two variables, that is:
\[\begin{bmatrix}Cov(x,x) & Cov(x,y)\\ Cov(y,x) & Cov(y,y)\end{bmatrix}\]
The main diagonal contains the variances of the individual variables since \(Cov(a,a)=Var(a)\), while off-diagonal values are the covariances between variables. Notice that each individual variable has a variance of 1. This is because we have already divided by the standard deviation during rescaling.
hp_norm wt_norm
hp_norm 1.00 0.66
wt_norm 0.66 1.00
Next, we calculate the eigen-decomposition of the covariance matrix. This gives two eigenvalues each associated with an eigenvector. The number of eigen values and eigenvectors is always equal to the number of input variables.
The largest eigenvalue represents the direction of maximum variance, which is analogous to our orthogonal regression line.
eigen() decomposition
$values
[1] 1.6647411 0.3352589
$vectors
[,1] [,2]
[1,] 0.7071068 -0.7071068
[2,] 0.7071068 0.7071068
Another great thing about this method over orthogonal regression projection is that we now have more information about our transformation. For instance, we can see the amount of variance explained by each of our new axes.
The total variance is given by the sum of the eigenvalues (\(\lambda\)): \[total\ variance\ = \sum_{i=1}^{N} \lambda_i\] So, the portion of the total variance “explained” by each of our new axes associated with each eigenvalue is given as \[portion\ of\ variance = \frac{\lambda}{total\ variance}\]
Now that we have the eigen-decomposition, we can project to our new axes. This again is done with matrix operations as:
\[New\ data=Eigen\ vectors^T*Original\ data^T\]
I will call our new axes \(eig1\) and \(eig2\), not to be confused with the eigen values \(\lambda_1\) and \(\lambda_2\).
Notice how our new feature space created through eigenvector projection is qualitatively identical to the orthogonal regression projection. While the center and scale of the axes are not identical, we can see they are perfectly correlated.
This shows that performing projection to an orthogonal regression line is essentially the same as projecting to the eigenvectors. So while eigenvectors and matrix operations may sound confusing, as long as the orthogonal regression projection makes sense, you can think of the eigenvector projection as just another mathematical way to do the same thing.
Now, for the big reveal you have all been waiting for: What we just did is … drumroll please … principal component analysis!
Orthogonal regression line = dominant eigenvector direction = first principal axis
Orthogonal residual direction = secondary eigenvector direction = second principal axis
So, behind the scenes of these fancy principal component functions, they are really just projecting to the directions given by the eigenvectors, which is pretty much the same as our orthogonal regression projections.
So, let’s take a look at the similarities between our eigenvector projections and the principal components found using the R function, prcomp()
.
Uh-oh. Why are they not perfectly correlated?
So far, we have performed re-scaling as a pre-pocessing step, because we saw the undesirable effects that occur when our variables have different magnitudes.
Now we are seeing another effect that our variables are having on PCA.
To make the final steps work, we will explore how re-centering was not necessary for orthogonal regression projection, but is necessary for PCA.
Takeaways
Previously, our orthogonal regression line had a non-zero intercept. This meant that the regression performed well no matter “where” our rescaled data were. However, PCA builds the new axes through the origin by definition.
To illustrate the issue with this, we can build an orthogonal regression with the constraint that the intercept be zero (i.e. the regression line passes through the origin). This simulates how principal components are chosen.
Now, we can project to this new regression line. I will call our new axes that result from this projection ortho_no_1
and ortho_no_2
.
Here is ortho_no_1
plotted against PC1
.
Now we have perfectly recapitulated Principal Component Analysis by projecting to an orthogonal regression line.
However, by performing the orthogonal regression through the origin, we actually lose a lot of sensitivity to the data.
When we force orthogonal regression to pass through the origin, the line is fully defined by just two points: the origin \((0,0)\) and the centroid of the data \((\bar{x}, \bar{y})\). This is not a very good representation of the data due to these restrictions.
To illustrate this, I can show what the first principal component (or orthogonal regression with fixed zero intercept) looks like on some fake data. As mentioned, the fit solely depends on the intercept and centroid, and ignoreS the actual slope of the data.
With this in mind, we have to consider our constraints. As I stated, principal axes (via matrix operations) always pass through the origin. So, we cannot get around that aspect when moving from orthogonal regression to principal component analysis.
But recall how we rescaled our data to improve the regression performance?
Now, we can do something similar. It can be shown that orthogonal regression, as well as principal components always pass through the centroid, which is the mean of each variable \((\bar{x}, \bar{y})\).
So what if we performed some operation on our data such that the centroid is at the origin? Then we no longer need to limit the fit of the regression just to ensure that the line passes through the origin, as it will naturally do so.
We can do that by subtracting the mean of each variable from each point. Recall that our rescaling operation was dividing each point by the standard deviation of each variable. Now if we both rescale (divide by \(\sigma\)) and recenter (subtract \(\bar{x}\)) the equation for transforming each variable is:
\(X=\frac{x-\bar{x}}{\sigma}\)
But keep in mind that \(x\) is referring to any variable, not just the variable on the \(x\) axis.
The above equation may look familiar. Indeed, it is the equation of a \(z\)-score to convert data to a standard normal distribution. In the standard normal distribution:
When the data are standardized, here is what they look like
Using the standardized variables, we can compare (a) the projection to orthogonal regression and (b) PCA.
When the data are standardized, then orthogonal regression projection is identical to PCA.
Takeaways
We have taken a road trip to fundamentally understanding principal component analysis.
There are a variety of reasons for performing dimension reduction, including simplifying the overall model by reducing the number of variables.
Some variables may be correlated, so there is a redundancy in their information. This makes itself apparent in that if you know the value of one correlated variable, you can predict the value of the other variable. Converting these correlated variables into a single new variable seems like a good choice for reducing the number of variables while retaining information.
Since linear regressions follow the pattern of correlated variables, we could project to a linear regression as a single new variable representing the original correlated variables.
However, ordinary least-squares regression is dependent on variable order. You will get a different axis dependent on which variable is on each axis.
Orthogonal regression (a.k.a. Deming or Total Least Squares regression) is independent of variable order, and is a better choice.
Additionally, when projecting to ordinary least-squares regression, the resulting variables may still be correlated. The resultant variables from projection to orthogonal regression are not.
Since PCA is essentially the same as this approach, there is a catch. As mentioned, the new resulting axes are uncorrelated by definition. If all the original variables are correlated, then only the first principal axis has any meaning. For this reason, exploratory data analysis is crucial to see which original variables are correlated. If they all are, then PCA is not advisable beyond the first principal component.
We need to standardize our variables for PCA for a few reasons.
We can negate the above concerns by standardizing each variable by converting to a \(z\)-score.
\(z=\frac{x-\bar{x}}{\sigma}\).
It should be noted that this method works best for normally distributed variables. PCA performance can be improved by pre-processing data to become approximately normally distributed before standardization. For instance, if your data are log-normally distributed, you may want to first take the \(\log\) of your variable before calculating \(z\)-score.
Since standardization is so crucial, most functions that are implementations of PCA in various programming languages (R, Python, etc.) have an option to perform centering and rescaling automatically. However, these operations are generally optimal for normally distributed data but the functions do not check this assumption. For this reason it is best to check for normality (e.g. with a Q-Q plot), and if they are not normal, perform your own pre-processing.
The orthogonal regression follows the direction of maximum variance of the variables. This direction can be calculated by eigen decomposition. This yields:
PCA is performed using eigen decomposition. This gives more information than simply projecting to the orthogonal regression lines (e.g. amount of variance represented by the new axes).
We started this journey by trying to predict vehicle fuel efficiency from weight and power. After doing some linear regressions on the variables independently, and then with multilinear regression, we decided to try dimension reduction to capture the effect of both weight and power in a single variable.
So to wrap it up, let’s take look at predicting fuel efficiency based on the first principal component
It seems that for this example case, mutlilinear regression is a better way of predicting fuel efficiency from the original two variables. And maybe that is the final lesson. This project showed us intuitively how PCA works. But in our case it was not necessary since we had so few variables, and actually made the performance worse. Sometimes “simpler” models like MLR are better!
Hopefully this has shown how PCA works. Now you may better understand how and why we preprocess data, what the variance measures indicate, and when PCA may actually be detrimental to model performance.