Comparison Between PLS and Multiple Regression

Page 136 out of 225 pages in this book.
Tripos Bookshelf > QSAR > QSAR Theory > QSAR Techniques

4.5.4.5 Comparison Between PLS and Multiple Regression

1. The ability to produce useful, robust equations even when the number of columns vastly exceeds the number of rows (number of values to estimate exceeds the number of observations). The CoMFA technique illustrates this useful property.

2. Better overall predictive performance and more robust values of coefficients, by extracting only those components which improve predictive performance.

3. Much lower sensitivity to the distributions of variable values, which for optimal performance in MR need to be individually normal and mutually orthogonal.

4. Considering more than one dependent variable at a time is straightforward and unambiguous in PLS. One case where this can be useful is antibacterial potency of a compound across a spectrum of micro-organisms, which can be analyzed in a single PLS run. This facilitates understanding of common and competing trends among the target variables.

5. Much more rapid computation with large data matrices by limiting the number of components extracted. Conventional matrix inversion is unnecessary. (However, crossvalidation re-derives a given model many times so this advantage is not always realized in practice.)

Studies characterizing the frequency of correlation within tables of random numbers, when using either stepwise regression or PLS with crossvalidation, show that each has a different risk. With stepwise regression, there is a high risk of accepting a chance correlation as correct and general. In contrast, with PLS there is the opposite risk of overlooking a correct and general correlation, if that correlation involves only a small subset of explanatory variables within a large number of irrelevant candidate variables. But most QSAR studies entail enough redundancy that the major risk is that an unrecognized chance correlation misdirects experimental work. Thus, the conservative behavior of PLS is generally preferable.

Ref. 50

A corollary of these results is that PLS works well when the explanatory variables are intercorrelated (non-orthogonal), while regression is untrustworthy when variables are not orthogonal. Again, in most QSAR studies the explanatory variables tend to be strongly intercorrelated, and PLS is the more useful technique.

Before proceeding to a formal description of the PLS algorithm, we will try to propose some ideas and metaphors which may help in understanding its behavior.

The notion of estimating many values from few examples often seems odd at first. Professor Wold (Umea, Sweden), the pioneer of PLS, offers a thought which may help in shifting one's point of view. "Traditional methods of data analysis such as MR require the experimental scientist to limit the number of explanatory variables they measure or calculate. This is like saying, `Too much knowledge about your problem is bad.' Does this make sense?"

Figure 24 Qualitative comparison of the multiple regression (MR) and PLS algorithms.

Figure 24

Principal Components Regression