Stat 411/511

Lab 7

Feb 18

Goals:

  • Calculating AIC, BIC and Cp from a fitted model
  • Using the leaps package for model selection

AIC, BIC and Cp

The problem with calculating AIC, BIC and Cp, is that different people drop different constants from the formulas. Since they are constant for a particular dataset dropping them doesn’t change the ranking of the models, but it does change the absolute values of the AIC, BIC and Cp statistics. So, let’s calculate the AIC, BIC and Cp following the formulas in Sleuth (and in the lecture slides).

Let’s use the example from the Sex discrimination study (we’ll see on Friday) of the best model,

library(Sleuth3)
library(ggplot2)
case1202$log_bsal <- log(case1202$Bsal)
library(plyr)
case1202 <- mutate(case1202,
  s = Senior, a = Age, e = Educ, x = Exper,
  t = (s - mean(s))^2, b = (a - mean(a))^2, 
  f = (e - mean(e))^2, y = (x - mean(x))^2,
  m = s*a, n = s*e, v = s*x, c = a*e,
  k = a*x, q = e*x)
fit1 <- lm(log_bsal ~ s + a + e + x + c + k, 
  data = case1202)

AIC and BIC only depend on sum of squared residuals (SSRes) for this model. Mallow’s Cp also requires σ² from the full model, so we’ll fit that too.

fit_full <- lm(log_bsal ~ s + a + e + x + c + k+
    t + b + f + y + m + n + v + q, 
  data = case1202)

We can calculate the SSRes by summing the squared the residuals,

n <- nrow(case1202)
p <- length(coef(fit1))
SSRes <- sum(residuals(fit1)^2)

For the full model we’ll pull σ from the summary output and square it,

sigma_sq_full <- summary(fit_full)$sigma^2  

Now we can plug directly into the formulas from Sleuth

SSRes/sigma_sq_full - n + 2*p# Cp
n * log(SSRes/n) + (p + 1) * log(n) # BIC
n * log(SSRes/n) + 2*(p + 1) # AIC

For a single model, these statistics aren’t very meaningful. There primary use is to compare models to each other. The leaps package is designed to search for possible models and calculate these statistics for them.

Using the leaps package

First you’ll need to install the leaps package

install.packages("leaps")

The function that does the work finding the best models is regsubsets, as an example of it’s use, here’s how I found the best models in the SAT case study

library(leaps)
case1201$log_Takers <- log(case1201$Takers)
all <- regsubsets(SAT ~ Expend + log_Takers + Income + Years + Public + Rank, 
  data = subset(case1201, State != "Alaska"), 
  nbest = 7, method = "exhaustive")

The first argument specifies the largest model you are willing to consider, here I specify my response is SAT, and the possible terms are all six explanatory variables. data should be pretty familiar by now, here I’m excluding Alaska. nbest specifies how many models should be returned for each size, I want the best seven models. method states the methods for finding the best models, exhaustive considers all models, other options are forward and backward. Another useful argument is nvmax the largest size model you want to consider (by default it is set at 8, so here we don’t need to change it).

The object that is returned is a bit complicated. You can have a look at it with summary and plot

summary(all)
plot(all)

Both of these outputs tell use which parameters are included in each model, but not much else. I’ve written a little function to pull out some other useful information

source(url('http://stat512.cwick.co.nz/code/fortify_leaps.r'))
models <- fortify(all)
## Warning: closing unused connection 6
## (http://stat512.cwick.co.nz/code/fortify_leaps.r)
head(models)

It adds in rss (SSRes), BIC and Cp (according to leaps definitions) and size as well as a short description of the model that can be used to fit it later. The columns that match our explanatory variables contain the parameter estimates for the model. We can use this to make a Cp plot

qplot(size, cp, data =models) + 
  geom_abline()
qplot(size, cp, data =models) + 
  geom_abline() +
  ylim(0, 15)
## Warning: Removed 15 rows containing missing values (geom_point).

Or a BIC plot

qplot(size, bic, data =models) 

If we like one model we can pick it out and fit it with lm. For example the model with the lowest Cp,

models[models$cp == min(models$cp), ]

includes Expend+log_Takers+Years+Rank, we can fit that with lm by,

fit_1 <- lm(SAT ~ Expend+log_Takers+Years+Rank, data = case1201)

Why you can’t trust p-values after model selection

The code below will generate a response and 25 explanatory variables that are not at all related to the response.

set.seed(181910)
n <- 100
p <- 25
my_data <- data.frame(y = rnorm(100),
  matrix(rnorm(n * p), nrow = n, ncol = p))

Then we will use model selection to find the “best” model of size 8 or smaller.

all <- regsubsets(y ~ ., data = my_data, nbest = 1,  
  method = "exhaustive")
models <- fortify(all)
models[models$cp == min(models$cp), ]

Then let’s fit the best model and look at the p-values:

summary(lm(y ~ X5 + X13  + X19 + X22, data = my_data))

Look how small they are!?! But none of these variables are actually related to the response. It is the process of model selection that biases our p-values down, and (if we aren’t wise to it) makes us incorrectly infer a relationship between the response and these explanatories. This highlights why you cannot trust the p-values from variables selected using model selection.

Try again, change the number inside set.seed() to generate a different data set. Run the code to find the “best” model and fit it using lm.