Stat 411/511
# Lab 7

##### Feb 18

Goals:

- Calculating AIC, BIC and Cp from a fitted model
- Using the
`leaps`

package for model selection

The problem with calculating AIC, BIC and Cp, is that different people drop different constants from the formulas. Since they are constant for a particular dataset dropping them doesn’t change the ranking of the models, but it does change the absolute values of the AIC, BIC and Cp statistics. So, let’s calculate the AIC, BIC and Cp following the formulas in Sleuth (and in the lecture slides).

Let’s use the example from the Sex discrimination study (we’ll see on Friday) of the best model,

AIC and BIC only depend on sum of squared residuals (SSRes) for this model. Mallow’s Cp also requires σ² from the full model, so we’ll fit that too.

We can calculate the SSRes by summing the squared the residuals,

For the full model we’ll pull σ from the summary output and square it,

Now we can plug directly into the formulas from Sleuth

For a single model, these statistics aren’t very meaningful. There primary use is to compare models to each other. The `leaps`

package is designed to search for possible models and calculate these statistics for them.

First you’ll need to install the `leaps`

package

The function that does the work finding the best models is `regsubsets`

, as an example of it’s use, here’s how I found the best models in the SAT case study

The first argument specifies the largest model you are willing to consider, here I specify my response is `SAT`

, and the possible terms are all six explanatory variables. `data`

should be pretty familiar by now, here I’m excluding Alaska. `nbest`

specifies how many models should be returned for each size, I want the best seven models. `method`

states the methods for finding the best models, `exhaustive`

considers all models, other options are `forward`

and `backward`

. Another useful argument is `nvmax`

the largest size model you want to consider (by default it is set at 8, so here we don’t need to change it).

The object that is returned is a bit complicated. You can have a look at it with summary and plot

Both of these outputs tell use which parameters are included in each model, but not much else. I’ve written a little function to pull out some other useful information

It adds in rss (SSRes), BIC and Cp (according to `leaps`

definitions) and size as well as a short description of the model that can be used to fit it later. The columns that match our explanatory variables contain the parameter estimates for the model. We can use this to make a Cp plot

Or a BIC plot

If we like one model we can pick it out and fit it with `lm`

. For example the model with the lowest Cp,

includes `Expend+log_Takers+Years+Rank`

, we can fit that with `lm`

by,

The code below will generate a response and 25 explanatory variables that are not at all related to the response.

Then we will use model selection to find the “best” model of size 8 or smaller.

Then let’s fit the best model and look at the p-values:

Look how small they are!?! But none of these variables are actually related to the response. It is the process of model selection that biases our p-values down, and (if we aren’t wise to it) makes us incorrectly infer a relationship between the response and these explanatories. This highlights why you cannot trust the p-values from variables selected using model selection.

**Try again, change the number inside set.seed() to generate a different data set. Run the code to find the “best” model and fit it using lm.**