PLS Reg: num of latent vectors| Real Statistics Using Excel

Introduction

PLS Regression uses latent vectors to simplify the regression model. The more latent vectors, the more the model captures the characteristics of the training data used to create the model. But more latent vectors also introduces potential multicollinearity issues and increases the complexity of the model. Just like in Factor Analysis, we seek a happy medium.

We consider two criteria: (1) percentage of variance captured by the model and (2) cross-validation.

The terminology used on this webpage is from PLS Regression Basic Concepts. Our examples are based on the data in Example 1 of PLS Regression Example.

Note that on this webpage we use A′ to refer to the transpose of matrix A.

% Variance Explained

Example 1: What is the percentage of the variance for X and Y explained by the latent vectors in each of the three iterations from PLS Regression Example?

Since matrices in Figure 2 of PLS Regression Example are mean centered, the total variance for X is SSX = 16, as calculated by =SUMSQ(B13:E17). Similarly, the total variance for Y is SSY = 12, as calculated by =SUMSQ(H13:J17).

The X variance explained at each stage is given by p′p (or the sum of squares of the elements in vector p). So, at the first stage, 70.5% of the variance is explained as calculated by =SUMSQ(B125:B128) in cell K115 of Figure 1 and =K115/$L112 in cell K116. We also see that all the X variance is explained after the three stages, and, in fact, 98.3% of the variance is explained in the first two stages.

Figure 1 – % of variance explained

For Y, it turns out that d² is the amount of variance explained at each stage. So, in the first stage, 63.3% of the variance is explained as calculated by =F130^2 in cell K121 and =K121/$L118 in cell K122. We see that 95.8% of the variance is explained after the three stages and 85.4% of the variance is explained in the first two stages.

We conclude that even 1 or 2 latent vectors may be enough.

Approach using eigenvalues

We now offer another way to calculate the % of variance contributed by each latent vector stage. We summarize this approach for Example 1 in Figure 2. References are made to the spreadsheets shown in Figures 2 and 8 of PLS Regression Example.

For iteration 1, we insert the formulas

=MMULT(B13:E17,F119:F122) in range Q112:Q116

=MMULT(TRANSPOSE(B13:E17),B112:B116) in range V112:V115

For iteration 2, we inseert the formulas

=MMULT(B43:E47,G119:G122) in range R112:R116

=MMULT(TRANSPOSE(B43:E47),C112:C116) in range W112:W115

And similarly for iteration 3.

Figure 2 – % of variance (version 2)

We next calculate the sum of squares for each column of Q112:S116 and V112:X115. We see that the resulting values in V116:X116 are the same as the X variance contributions in K115:M115.

If we divide the eigenvalue shown in cell P28 (or Q35) of Figure 1 in Eigenvectors in PLS Regression, as repeated in cell Q118 above by the sum of squares for Q112:Q116, we obtain the value shown in cell Q119, which is the same as the Y variance contributed by the first latent vector, as shown in cell K121.

If we repeat the calculations shown in Figure 1 of Eigenvectors in PLS Regression for the eigenvalues for the second and third latent vectors, we would obtain the values shown in R118 and S118. Dividing by the sum of squares shown in R117 and S117, we would obtain the variances attributed to the second and third latent vectors.

Cross Validation

See Cross Validation for the definition of PRESS and how to calculate it for ordinary linear regression in Excel.

Example 2: Calculate the PRESS value for the PLS regression model based on Example 1 data with 3 latent vectors.

The result is shown in Figure 3. To streamline the analysis we use the Real Statistics PLSRegPred worksheet function to determine the predicted Y values for various PLS regression models and X data. See Real Statistics Support for PLS Regression for details.

Figure 3 – PRESS calculation

Referencing the X and Y data in Figure 1 of PLS Regression Example, we place the Real Statistics array formula

=PLSRegPred(B4:E4,DELROW(B$4:E$8,A183),DELROW(H$4:J$8,A183),3)

in range B183:D183, highlight range B183:D187, and press Ctrl-D.

We next obtain the residuals by placing the formula =B183-H4 in cell F183, highlighting range F183:H187, and pressing Ctrl-R and Ctrl-D. Finally, we arrive at the desired result shown in cell G189 by using the formula =SUMSQ(F183:H187).

If we perform the same operations for h = 1 and 2, we obtain the chart shown in Figure 4.

We were hoping for a scree-like curve (as in Factor analysis), but with so little data and few independent variables, this is not what happened. Even though there is no knee in this curve, but it does look like h = 1 or h = 3, is a reasonable choice.

Examples Workbook

Click here to download the Excel workbook with various PLS Regression examples, including examples described on this webpage.

Click here for additional information about PRESS for PLS Regression.

References

Hervé Abdi (2003) Partial least squares (PLS) regression
https://www.utdallas.edu/~herve/Abdi-PLS-pretty.pdf

Sawatsky, M. L., Clyde, M., Meek, F. (2015) Partial least squares regression in the social sciences
https://www.tqmp.org/RegularArticles/vol11-2/p052/p052.pdf

Wise, B. M. (2019) Properties of partial least squares (PLS) regression, and differences between algorithms
https://eigenvector.com/Docs/Wise_pls_properties.pdf

PLS Regression: How many latent vectors?