Subset selection, regularization, and shrinkage
#1
Posted 2011-November-21, 14:31
http://blogs.mathworks.com/loren/
#2
Posted 2011-December-06, 11:27
(This piece introduces a newish regression algorithm called "Lasso" which offers some significant advantages compared to traditional linear regression)
http://blogs.mathwor...ization-part-2/
#3
Posted 2011-December-06, 12:11
You gotta luv Krugman. Humble country school teacher comes from nowhere to be the Rush Limbaugh of the left. Ain't America grand.
#4
Posted 2011-December-06, 12:32
jdeegan, on 2011-December-06, 12:11, said:
I have to confess that despite being a practicing economist I haven't come across ridge regression or lasso before (my thanks to hrothgar for posting these links). But I'm well aware of the dangers of "over-fitting" when estimating equations using standard regression techniques that then don't prove very useful for forecasting - for example, including too many lags because they appear to be significant when looking at t-stats, etc. Putting a premium on parsimony therefore seems an attractive idea. Doing it by "forcing" coefficients towards zero doesn't necessarily seem an intuitive way of doing it, but I can see advantages in this, too, eg in the cases where traditional estimation can often produce a series of lags with opposite signs, which even though they can be significant in helping to fit the data over the estimation period seem unlikely to be able to help in predicting the future.
#5
Posted 2011-December-06, 12:38
Linear Regression identifies a set of coefficients that minimize the sum of the squared errors between predicted and actual.
Lasso changes this minimization problem. We identify a set of coefficients that minimizes the sum of the squared errors plus the sum of the absolute value of the regression coefficients. (We're using an L1 norm)
Ridge regression (aka Tikhonov regularization) is the same as lasso except we substitute an L1 norm for the L2 norm. This time around we identify a set of coefficients that minimized the sum of the squared errors plus the sum of the square of the coefficients. As usual, the math is a lot easier with an L2 norm, which is why Tikhonov solved this problem a long time before lasso was a twinkle in Tibshirani's eye...
As for motivation:
1. The predictive accuracy of linear regression models suffers dramatically if you have relatively wide data sets with strong correlation between your independent variables.
2. Regularization techniques like ridge regression and lasso are often able to significantly improve predictive accuracy (at the cost of increasing your bias)
3. Lasso and ridge regression differ in the choice of the norm. The L1 norm will cause the lasso to quickly drive individual regression coefficients completely to zero, there by acting as a feature selection technique. The L2 norm used by ridge will preserve larger numbers of independent variables within the model.
4. There is also something known as an elastic net which is a convex combination of a ridge regression and a lasso and offers many of the best properties of both.
#6
Posted 2011-December-06, 13:17
hrothgar, on 2011-December-06, 12:38, said:
So lasso and ridge overcome the problem of multicollinearity?
"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."
Simplify the complicated side; don't complify the simplicated side.
#7
Posted 2011-December-06, 13:22
S2000magic, on 2011-December-06, 13:17, said:
Much of the time, yes. However, you're decreasing variance by increasing bias
#9
Posted 2011-December-06, 13:50
WellSpyder, on 2011-December-06, 12:32, said:
Here's an intuitive explanation that might help.
Assume that you have a linear model where Y = f(X1, X2, ... XN) + noise vector
Furthermore, lets assume that one of these variables is a linear function of the other.
If you run your regression, the program will probably throw some warning about a rank deficient matrix, the reason being that you can't estimate a unique values for these two coefficient. Any linear combination of the two coefficients in the right ratio is equally valid.
Now perturb one of your observations by epsilon so that you no longer have this whole "rank deficiency" issue. Your regression is going to run perfectly fine. However, there's a catch... Relatively minor changes to your noise vector are going to cause enormous swings in your regression coefficients for the two correlated variables. Sometimes they'll be sitting at (+500, + 800), the next at (-15, - 24), the time after that at (-2500, -4000). If you want to believe that these coefficients have some real world meaning, this behavior is really annoying.
Adding in the regularization term penalizes solutions that are far removed from zero and makes the entire process much more stable.
#10
Posted 2011-December-06, 15:35
#11
Posted 2011-December-06, 15:47
jdeegan, on 2011-December-06, 15:35, said:
That's so old-school; you probably also bid suits you have.
"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."
Simplify the complicated side; don't complify the simplicated side.
#13
Posted 2011-December-06, 22:32
jdeegan, on 2011-December-06, 17:50, said:
OK, now you're scaring me.
"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."
Simplify the complicated side; don't complify the simplicated side.
#14
Posted 2011-December-07, 00:23
#16
Posted 2011-December-07, 08:04
jdeegan, on 2011-December-07, 00:23, said:
I drove a 911S for many years (20,000 miles when I bought it, 180,000 miles when I sold it), and I can tell you hands down the S2000 is more fun to drive than the Porsche; and the Porsche was a blast to drive!
There is something sweet about a 9,000 RPM redline.
"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."
Simplify the complicated side; don't complify the simplicated side.
#17
Posted 2011-December-07, 08:07
WellSpyder, on 2011-December-07, 04:07, said:
If they're sufficiently strongly correlated (positively or negatively), does it really matter which one(s) you drop?
"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."
Simplify the complicated side; don't complify the simplicated side.
#18
Posted 2011-December-07, 08:29
hrothgar, on 2011-December-06, 13:22, said:
LASSO is not good at dealing with correlated predictors. If there are two strongly correlated predictors it may simply be impossible to determine which of the two is the causal one and which one works only through confounding with the other. In that case, the most robust thing you can do is to give each of them approximately equal influence. This what RIDGE does. Stepwise AIC has the same problem as LASSO.
So if your main concern is to deal correctly with correlated predictors, RIDGE is preferable to just about everything else, although I suppose the best thing to do would be to have a serious talk with the domain expert to try to get to a more advanced model that captures the domain knowledge better. For example, you might put an L2 (RIDGE) penalty on coefficients that belong to clusters of two or more correlated predictors, while putting an L1 (LASSO) penalty on the lonely riders. RIDGE and LASSO are somewhat adhoc methods, they are the methods you will use when you have large data sets but shallow domain knowledge.
As for the bias, yes, but that is intentional, you apply biased estimators like RIDGE when the bias is a virtue. You have a prior belief that small coefficients are more plausible than large ones so the mean (or mode) posterior belief must be smaller than an unbiased estimator.
#19
Posted 2011-December-07, 08:33
#20
Posted 2011-December-07, 08:55
nige1, on 2011-December-07, 08:33, said:
Loren is, indeed, great.
However, I feel obliged to point out that those two articles (and all the code) were authored by moi...