When one model is a special case of another (can be obtained by specifying values for some parameters) they can be compared using an extra Sum of Squares F-test. A common example is comparing the separate lines to parallel lines model (the parallel lines model is the separate lines model with the coefficients on the interaction terms are zero). The simpler model (the one with fewer parameters) is our null model. A small p-value in the extra sum of squares F-test, gives evidence against the simpler model. In R, you fit both models with lm, then use the
anova function to do the test,
In this case we have no evidence against the parallel lines model.
Another common use is to compare models where a continuous variable could be treated as a categorical variable (like a lack of fit F-test). For example, thinking back to the Meadowfoam example, we could compare a model with a set of indicator variables for Intensity to a model where Intensity enters as a continuous variable,
A p-value of 0.6848 give us no evidence against the simpler model, treating Intensity as continuous. Or in other words, we have no evidence against the mean number of flowers being a straight line function of Intensity.
Why can’t we put Mass in as an Indicator variable in the echolocating bat study?
We are going to use part 1 of homework 3 as our starting point. Let’s say I’ve decided on the following model:
na.action = na.exclude is new for you. R automatically ignores observations with missing values (
NAs) but this additionally tells R to return NA for residuals or predictions for these observation with missing values. There is no harm in always having it in your call to
One way to explore the structure of our model, is to create estimates of the mean response at various values for the explanatory variables. The first step is to set up a new data.frame that describes all the explanatory values we are interested in. Let’s estimate a mean log Income, for every combination of gender, with all the unique values of years of education, and intelligence score in steps of 10 from 0 to 100.
Take a look at new df. Each row specifies, values for the explanatories we want to estimate the mean at. The first for example, is a male, with a score of zero on the AFQT and 12 years of education.
** Why didn’t I want to estimate the mean for a AFQT score of 110?**
expand.grid creates a data.frame from all combinations of the supplied variables.
unique finds all possible values that occurred in the data, so for example in the
Gender column it returns,
female. If you are curious try running just the parts on the RHS of the equals, i.e.
We can get the estimated mean income using the
Remember, the estimated mean response is the same as the predicted response. This returns 330 numbers, one for each row of our
newdf data.frame, so let’s store them in a new column called
Now, it’s up to us to explore different ways of plotting the results to explore the model. Here’s one idea, mean income against education, with lines joining the intelligence levels, faceted by gender.
Here’s another, mean income against intelligence, just for 12 years of education, but a different line for males and females,
These are on the log scale, but we can backtransform our mean by taking the exponential of our estimated mean.
Fit a new model that adds a Educ:AFTQ interaction, and repeat the above plots. Describe the change in the modelled means.
Let’s add some confidence intervals on the mean response.
Compare the structure of these two commands (i.e. use
In the second line, the
interval = "confidence" argument asks for confidence intervals on the mean response to be returned as well as fitted values. The first command returns a vector of 16 numbers and that’s why it can be saved directly into a new column in
newdf above. The second command returns a list of data.frame with 3 columns. It doesn’t make sense to store three columns in one column so instead we join it to our newdf data.frame with
cbind (short for column bind)
Now we can use this
newdf to plot our confidence interval on the mean response, but to keep our plot simple for now, let’s do one education level for males (see the
subset in the data argument below),
I used different line types just to illustrate which is which.
Another option is to use a ribbon for the interval
Try plotting the estimated mean income for both mean and women with 12 years of Education with confidence intervals. Add
colour = Gender, to the
qplot function to make sure the two lines are drawn separately.
If you want to see the estimated median response on the original scale just back-transform the mean and interval endpoints
Prediction intervals can be produced by using the
interval = "prediction" argument in
If you try to add the results of
predict to newdf you’ll run into a problem where you have multiple columns called
upr. One way around this is to rename the columns before
cbinding them to newdf.
Then we can add them to our plot