6 min read

Principal components of explained variance

I have been spending most of my time on a very interesting technique that has, unfortunately, received little attention (I’ll come back later about some possible reasons). The purpose of this post is to introduce this method and give motivation for its use.

The setting is the following: say we have a set of variables \( Y_1, \ldots, Y_p \), and we are interested in the relationship between them and a covariate \( X \) (or possibly several covariates). If the \( Y \)s are independent, we can look at all pairs \( (Y_i, X) \) and study their relationship individually; the information contained in a pair is in no way relevant to another pair.

In general, the two following problems can arise:

  1. The variables \(Y_i \) are rarely truly independent. For example, they could represent several ways of measuring a common quantity (e.g. body fat) or simply distinct, but correlated measurements (e.g. forced expiratory volume measured every week for two months, or methylation values at several loci along a gene).

  2. The number of variables \( p \) may be very large, thus rendering the pairwise analyses both computationally long and inefficient.

Several strategies can be used to deal with these issues: we can filter the variables, in the hope that we can retain most of the information while removing the noise; we can perform variable selection (e.g. LASSO, SCAD, elastic net); or we can perform dimension reduction. Principal component analysis (PCA) is an example of the latter.

(The first two strategies look very similar, but the main difference lies in the dimensionality of the procedure: filtering is typically done one dimension at a time, while variable selection is performed on the \( p \)-dimensional space.)

Principal components analysis

The main goal of PCA is to find a linear combination of the variables \( \bar{Y} = a_1 Y_1 + \cdots + a_p Y_p \) that ‘‘best’’ represents them. This is achieved by solving an optimization problem: the best linear combination is such that \( \bar{Y} \) has the largest possible variance. In this way it is possible to reduce to dimension from \( p \) to just 1!

Of course, a few questions arise immediately:

  1. How much information are we losing in the process?

  2. Is a linear combination really the best way to represent a set of vectors?

While the first question can be assessed by looking at how much of the overall variance of the variables is retained by \( \bar{Y} \), the second question is a lot more subtle: it heavily depends on the problem at hand. (By the way, for a non-linear extension of PCA, you can have a look at kernel PCA.)

However, there is another issue to keep in mind: what guarantee do we have that \( \bar{Y} \) is at all associated with the covariate \( X \)? The answer: none whatsoever. This is where the Principal Components of Explained Variance (PCEV) can be seen as a useful alternative.

Principal Components of Explained Variance

PCEV was first introduced in the context of family-based linkage analyses of complex traits by Ott and Rabinowitz. Its purpose was to improve the power of analyses, in the context where researchers were usually performing PCA or using a univariate approach (analyzing one variable \( Y_i \) at a time).

For the remainder of this post, we will group the variables into a vector \( \mathbf{Y} = (Y_1, \ldots, Y_p) \), and assume a linear model for the relationship between our variables and the covariate:

\[ \mathbf{Y} = BX + E. \]

The relationship is therefore summarized by a vector of regression coefficients \( B \).

Using this model, the overall variance of \( \mathbf{Y} \) can be decomposed into a model and a residual component:

\[ \mathrm{Var}(\mathbf{Y}) = B\mathrm{Var}(X)B^T + \mathrm{Var}(E). \]

If we take a linear combination of the variables (which can be written in matrix notation as \( a^T \mathbf{Y} \) for a vector of coefficients \( a \)), we get a similar variance decomposition:

\[ \mathrm{Var}(a^T \mathbf{Y}) = a^T B\mathrm{Var}(X)B^T a + a^T\mathrm{Var}(E)a. \]

The main difference, however, is that this is a sum of scalar quantities, and therefore we can take their ratio:

\[ h^2(a) = \frac{a^T B \mathrm{Var}(X)B^T a}{a^T\mathrm{Var}(E)a}. \]

(In genetics, this ratio is known as the heritability.)

Whereas PCA seeks the linear combination \( \bar{Y} \) with the largest variance, PCEV looks for the linear combination which maximizes \( h^2(a) \).

Winner’s curse?

Unlike PCA, PCEV is likely to find a linear combination which is actually associated with our covariate of interest \(X\). Indeed, we are maximizing this association. Should this worry us? are we introducing some kind of bias?

Not necessarily. Doing linear regression between the linear combination \( \bar{Y} \) and \( X \) will most likely result in inflated estimates, but there are other ways to get correct p-values. As always, we can perform permutations and try to estimate the null distribution. As a special case, if there is only one covariate \( X \), we can actually derive an asymptotic distribution for the proportion of variance \( h^2(a) \). In other words, it is possible to perform correct inference with PCEV.

Is it any good?

I have spent a lot of time doing simulations and studying the overall behaviour of PCEV. I’m currently writing up the results and hope to submit them soon for publication (I’ll write a post about it when it’ll get published), but overall the results are very interesting:

  • The Type I error is well controlled, even in the presence of strong correlation between the variables \( Y_i \). The same can be said of PCA in general, but it is well known that a univariate analysis followed by a Bonferroni correction will tend to be too conservative, and therefore lead to smaller Type I error than desired.

  • The power of PCEV is generally much higher than that of PCA, and also higher than a univariate approach. I will share the details of these simulations at some other time.

Little attention

PCEV has, unfortunately, received little attention in the scientific community. I think there are mainly three reasons:

  1. It was originally introduced as Principal Components of Heritability (PCH), and therefore it had a strong genetic connotation. By changing the name, we hope we can make it visible to a broader audience.

  2. The original authors, as well as the people who extended their work, did not publish any software implementation of PCEV. Therefore, using it meant writing your own code from scratch. However, I am currently working on an R package. It is still in development, but it can already be used.

  3. People have been using other methods for a long time, and it’s hard to make them change their workflow. Hopefully, the paper can convince some of them of the advantages of PCEV.

I hope you find this method interesting! I will probably publish more post about it, since my doctoral work centers on an extension of this framework.