Goals:
leaps
package for model selectionThe problem with calculating AIC, BIC and Cp, is that different people drop different constants from the formulas. Since they are constant for a particular dataset dropping them doesn’t change the ranking of the models, but it does change the absolute values of the AIC, BIC and Cp statistics. So, let’s calculate the AIC, BIC and Cp following the formulas in Sleuth (and in the lecture slides).
Let’s use the example from the Sex discrimination study (we’ll see on Friday) of the best model,
AIC and BIC only depend on sum of squared residuals (SSRes) for this model. Mallow’s Cp also requires σ² from the full model, so we’ll fit that too.
We can calculate the SSRes by summing the squared the residuals,
For the full model we’ll pull σ from the summary output and square it,
Now we can plug directly into the formulas from Sleuth
For a single model, these statistics aren’t very meaningful. There primary use is to compare models to each other. The leaps
package is designed to search for possible models and calculate these statistics for them.
First you’ll need to install the leaps
package
The function that does the work finding the best models is regsubsets
, as an example of it’s use, here’s how I found the best models in the SAT case study
The first argument specifies the largest model you are willing to consider, here I specify my response is SAT
, and the possible terms are all six explanatory variables. data
should be pretty familiar by now, here I’m excluding Alaska. nbest
specifies how many models should be returned for each size, I want the best seven models. method
states the methods for finding the best models, exhaustive
considers all models, other options are forward
and backward
. Another useful argument is nvmax
the largest size model you want to consider (by default it is set at 8, so here we don’t need to change it).
The object that is returned is a bit complicated. You can have a look at it with summary and plot
Both of these outputs tell use which parameters are included in each model, but not much else. I’ve written a little function to pull out some other useful information
It adds in rss (SSRes), BIC and Cp (according to leaps
definitions) and size as well as a short description of the model that can be used to fit it later. The columns that match our explanatory variables contain the parameter estimates for the model. We can use this to make a Cp plot
Or a BIC plot
If we like one model we can pick it out and fit it with lm
. For example the model with the lowest Cp,
includes Expend+log_Takers+Years+Rank
, we can fit that with lm
by,
The code below will generate a response and 25 explanatory variables that are not at all related to the response.
Then we will use model selection to find the “best” model of size 8 or smaller.
Then let’s fit the best model and look at the p-values:
Look how small they are!?! But none of these variables are actually related to the response. It is the process of model selection that biases our p-values down, and (if we aren’t wise to it) makes us incorrectly infer a relationship between the response and these explanatories. This highlights why you cannot trust the p-values from variables selected using model selection.
Try again, change the number inside set.seed()
to generate a different data set. Run the code to find the “best” model and fit it using lm
.