Stat 411/511

Lab 3

Jan 21st

Compiling reminder

Your TAs are much happier when your homework submissions are nicely formatted Compiled Notebooks and they don’t have to search through pages to find your answers. Remember,

  • Use #' for lines that include your answers to questions, so it looks like text in Word rather than code.
  • Use # for plots that you looked at but don’t want to include
  • Start a line with #' # to add a big heading, or #' ## for a smaller heading. Use these for question numbers and part numbers to help your TA navigate.

Your TA will quickly highlight this in Joe’s Homework 1 solution code. In future homeworks you will lose 0.5 points for a document that is hard for the TA to find your answers in.

Goals

A shortish lab today to give your TAs time to go over question relating to HW#2 or the practice quiz. We have a couple of things to cover regarding fitted multiple regression models in R.

  • Confidence intervals on single parameters
  • Confidence intervals on linear combinations of parameters
  • Residual plots for multiple regression

Most of the commands will be familiar from simple linear regression.

Confidence intervals on parameters

From the case1002 example from class, our question of interest could be answered with a test that the coefficient on the echo-locating bat indicator variable was zero. Let’s fit the model in R first,

library(Sleuth3)
library(ggplot2)
fit_bats <- lm(log(Energy) ~ log(Mass) + Type, data = case1002)
summary(fit_bats)

Notice the Type rows. There is one for non-echo-locating birds and one for non-echolocating bats, R has made the echolocating bats the reference level (and therefore it doesn’t appear in the model). This isn’t how we represented the model in class and we can’t answer our question of interest directly with this output. To fix that, we to redefine the reference level to be non-echolocating bats. We’ll do that by making a new column:

case1002$Type2 <- relevel(case1002$Type, ref = "non-echolocating bats")

and use the new column in our regression model,

fit_2 <- lm(log(Energy) ~ log(Mass) + Type2,
    data = case1002)
summary(fit_2)
confint(fit_2)

Now we can read off the parameter estimate for the echo-locating bats, 0.07866, and the test for \(\beta_3=0\), p-value = 0.703. Confidence intervals for the parameters can be obtained with

confint(fit_2)
##                               2.5 %  97.5 %
## (Intercept)                 -2.1853 -0.9674
## log(Mass)                    0.7205  0.9094
## Type2echolocating bats      -0.3510  0.5083
## Type2non-echolocating birds -0.1398  0.3443

As a check, we can calculate the 95% confidence interval for the intercept as,

-1.57636  + c(-1, 1)  * qt(0.975, 16) * 0.28724
## [1] -2.1853 -0.9674

Check the intervals for the other two parameters.

The parameter that characterizes the offset between echolocating bats (the reference level) and non-echolocating birds can also be read off the regression fit.

If we are interested in the offset between non-echolocating birds and echolocating bats, we need to refer to our first model, where echolocating bats were the reference level:

summary(fit_bats)
confint(fit_bats)

Now the Typenon-echolocating birds parameter directly captures the offset between non-echolocating birds and echolocating bats, we estimate it to be 0.0236.

Try setting non-echolocating birds as the reference level. Which parameter captures the offset between non-echolocating birds and echolocating bats? How does it differ to the one above?

Residual plots in multiple regression

There really isn’t anything new. Checking assumptions of linearity, constant spread and normality are exactly the same as in simple linear regression. The only difference being you might want to check the residuals against each explanatory (now we have more than one).

Fit the model, plot with the model in the data argument to qplot. For the bats case study, we first might look at the residuals against the explantories,

qplot(`log(Mass)`, .resid, data = fit_2)
qplot(Type2, .resid, data = fit_2)

Notice that the data argument is the model fit, not the original data.frame. Also note the backticks, around log(Mass) this is because the model has a log(Mass) term in it which isn’t technically a valid R variable name, but the backticks let us use it. Looks like there might be some evidence that birds have a larger spread about the regression line than the bats, but it could also be an artifact of having many more bird measurements than bat measurements.

We should also check the residuals against fitted values

qplot(.fitted, .resid, data = fit_2) +
  geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
qplot(sample =  .resid, data = fit_bats)

Any signs of non-constant spread or non-normality?