Stat 411/511

Lab 5

Feb 4

Goals:

Partial residuals
Case influence statistics

Partial residual plots

R can find partial residuals for you. If you fit a model, you can use the residuals function with the argument type = "partial" to find the partial residuals for each term in the model. That is the residual variation in the response after accounting for all of the other variables in the model. Consider the example from class,

library(Sleuth3)
library(ggplot2)
fit <- lm(log(Brain) ~ log(Body) + log(Gestation), data = case0902)
residuals(fit, type = "partial")

The result is a matrix with one row for every observation and one column for every term in our model. The numbers are the partial residual for the corresponding term in the model. Let’s add those column to our orginal data frame so we can plot them,

p_res <- residuals(fit, type = "partial") 
colnames(p_res) <- paste("res_", colnames(p_res), sep = "") # change column names to be more distinct
case0902_res <- cbind(case0902, p_res)
head(case0902_res)

Now we can plot them as usual,

qplot(log(Gestation), `res_log(Gestation)`, data = case0902_res)

Remember, this shows you the relationship between the response, log brain mass, and log gestation length, after accounting for the other variables in the model (log(body mass)).

You might want to compare it to the relationship we see in the raw data

qplot(log(Gestation), log(Brain), data = case0902_res)

Case influence statistics

In order to investigate the case influence statistics, we will first extract the relevant quantities from the fitted model and store them in a new data.frame, fit_diag

fit_diag <- fortify(fit, case0902)
head(fit_diag)

The columns .hat, .cooksd and stdresid contain the leverage, Cook’s Distance and Studentized residuals respectively. It’s often useful to add a column that just indexes the observations:

fit_diag$obs_num <- 1:nrow(fit_diag)

Let’s take a look:

# investigate influence
qplot(obs_num, .hat, data  = fit_diag) + ggtitle("Leverage")
qplot(obs_num, .cooksd, data  = fit_diag)  + ggtitle("Cook's distance")
qplot(obs_num, .stdresid, data  = fit_diag)  + ggtitle("Studentized residual")

Statisticians use rough cutoffs of 2p/n (p is the number of parameters in the model) for leverage, 1 for Cook’s Distance and above 2 or below -2 for standardized residuals. Observations falling outside these ranges warrant further attention.

** Do any points warrant further investigation?**

We might want to investigate certain points, for example, which observations have a studentized residual greater than 2, this is easily achevied with subsetting:

subset(fit_diag, .stdresid > 2)

Dolphins and human beings have log brain sizes that fall quite far above our line for the mean log brain size. We could also look for points with large negative studentized residuals:

subset(fit_diag, .stdresid < -2)

** Subset out the two observations with the highest leverage?**

You can get all three plots on one page with:

library(reshape)
qplot(obs_num, value, 
  data = melt(fit_diag[, c("obs_num",".hat",".cooksd",".stdresid")], 
    id.vars = "obs_num")) + 
  facet_grid(variable ~ ., scale = "free_y")

Take a look at this data set, where Y is the reponse and X1 and X2 two explanatory variables:

load(url("http://stat512.cwick.co.nz/data/case_inf.rda"))
qplot(X1, Y,  data = df, colour = obs_num == 101)
qplot(X2, Y, data = df, colour = obs_num == 101)
qplot(X1, X2, data = df, colour = obs_num == 101)

We fit the model:

fit <- lm(Y ~ X1 + X2, data = df)

Would you expect the point highlighted to have high leverage, Cook’s Distance and/or studentized residuals?

Check your reasoning by plotting this:

fit_diag <- fortify(fit, df)
qplot(obs_num, value, 
  data = melt(fit_diag[, c("obs_num",".hat",".cooksd",".stdresid")], 
    id.vars = "obs_num"), colour = obs_num == 101) + 
  facet_grid(variable ~ ., scale = "free_y")