During creation of regression models, collinearity can occur, which causes invalid results for individual regressors
William James once said: “We must be careful not to confuse data with the abstractions we use to analyze them.” Collinearity is a problem that occurs during the creation of regression models. It is the presence of intercorrelation among predictor variables. In other words, it occurs when a regressor is a linear combination of one or more other regressors. Although a model with collinearity still may have good predictive ability, the results about an individual regressor may not be valid. I dealt with model building in a previous article (“How to Select a Useful Model,” Scientific Computing, February 2012). This article is a follow up. I will divide the topic into indicators of collinearity, diagnostic tests for collinearity, and correction for collinearity.
Indicators of collinearity are: parameter tests are insignificant for theoretically important parameters, parameter tests are insignificant whereas the whole model test is significant, large standard errors for regression coefficients, extreme variability in parameters across samples, large changes in parameters when changing data or either adding or removing other variables, unexpected signs for parameters, and decreases in regression standard errors when removing a variable. These can be determined from output tables from standard analysis, such as from SAS Proc Reg (Figure 1).
A visual indication can be found using Leverage Plots with JMP Fit Model platform. Collinearity is seen as shrinkage of points in the X direction (Figure 2). X3 is involved in collinearity, since values shrink toward the center of the plot. X4 is not involved in collinearity, since values are dispersed along X axis and shown for comparison.
Diagnostic tests for collinearity are: VIF (variance inflation factor), correlation matrix of variables, eigenvalues and eigenvectors, condition indices, variance proportions. VIF is calculated as 1/(1 – Ri2), where Ri2 is the coefficient of determination of the regression for the ith input variable on all other input variables. It demonstrates how much collinearity increases coefficient estimate instability. It is available as an option under SAS Proc Reg (Figure 3) or in JMP. Using R2 = 0.5411 from Figure 1, we can calculate a VIF = 2.1791. Variables with VIF greater than this value (i.e. X2 and X3) are more closely associated with other X (independent) variables than the Y (dependent) variable.
The correlation matrix of variables also is available as an option under SAS Proc Reg (Figure 4) or in JMP. High correlations between variables (e.g. -0.8749 for X2 and X3; -0.8806 for X6 and X7) are indicators of collinearity.
Principal component analysis to produce eigenvalues near zero is an indication of collinearity. This can be generated using JMP Principal Components Analysis on Correlations (see Figure 5, eigenvalue = 0.0810 for Principal Component 7). Eigenvectors may show which variables are involved if there are large “loads” (values) for several variables for principal components of low eigenvalue (Figure 5, Principal Component 7 has large loads on X6 and X7).
Condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. When condition indices are greater than 10, this is an indication that regression estimates may be affected. This can be generated using SAS Proc Reg collin option to include the intercept or collinoint option to adjust the intercept out first (Figure 6 using collinoint option). None of the condition indices indicate a collinearity problem.
Variance proportions are the proportion of the variance of the estimates accounted for by each principal component. If there is a principal component with a high condition index that contributes significantly to the variance of at least two variables, this is an indication of collinearity (Figure 6). None of the principal components indicate a collinearity problem.
Correction of collinearity is more difficult than diagnosis. Methods for dealing with collinearity should begin with increasing sampling size, since this should decrease standard error. If this is not feasible, removal of intercorrelated variables can be approached using some of the methods I discussed in “How to Select a Useful Model,” Scientific Computing, February 2012, such as stepwise regression using SAS Proc Reg or JMP Fit Model. Ensure that interaction terms use centering, i.e. transformation by subtracting by the mean. Redefine the variables by using an alternative form, such as a percentage or per capita. If these more straightforward approaches don’t work, then more elaborate approaches may be needed, such as removing the variance in one of the intercorrelated variables by regressing them on that variable or analyzing the common variance as a separate variable.
The variables also can be transformed to principal components, where those with small eigenvalues can be eliminated, but the larger question is whether these principal components are interpretable. Ridge regression also can be considered where other options don’t work. It introduces a small bias in exchange for a reduction in sampling variances.
During the creation of regression models, collinearity can occur, which causes invalid results for individual regressors, although the overall model still can have good predictability. There are several indicators of collinearity, but diagnostic tests, such as VIF, should be performed. Once identified, a strategy of dealing with collinearity should proceed from the simplest (sample size increase) to the more complex if needed (ridge regression).
Mark Anawis is a Principal R&D Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at [email protected].