When one model is a special case of another (can be obtained by specifying values for some parameters) they can be compared using an extra Sum of Squares F-test. A common example is comparing the separate lines to parallel lines model (the parallel lines model is the separate lines model with the coefficients on the interaction terms are zero). The simpler model (the one with fewer parameters) is our null model. A small p-value in the extra sum of squares F-test, gives evidence against the simpler model. In R, you fit both models with lm, then use the anova
function to do the test,
In this case we have no evidence against the parallel lines model.
Another common use is to compare models where a continuous variable could be treated as a categorical variable (like a lack of fit F-test). For example, thinking back to the Meadowfoam example, we could compare a model with a set of indicator variables for Intensity to a model where Intensity enters as a continuous variable,
A p-value of 0.6848 give us no evidence against the simpler model, treating Intensity as continuous. Or in other words, we have no evidence against the mean number of flowers being a straight line function of Intensity.
Why can’t we put Mass in as an Indicator variable in the echolocating bat study?
We are going to use part 1 of homework 3 as our starting point. Let’s say I’ve decided on the following model:
The argument na.action = na.exclude
is new for you. R automatically ignores observations with missing values (NA
s) but this additionally tells R to return NA for residuals or predictions for these observation with missing values. There is no harm in always having it in your call to lm
.
One way to explore the structure of our model, is to create estimates of the mean response at various values for the explanatory variables. The first step is to set up a new data.frame that describes all the explanatory values we are interested in. Let’s estimate a mean log Income, for every combination of gender, with all the unique values of years of education, and intelligence score in steps of 10 from 0 to 100.
Take a look at new df. Each row specifies, values for the explanatories we want to estimate the mean at. The first for example, is a male, with a score of zero on the AFQT and 12 years of education.
** Why didn’t I want to estimate the mean for a AFQT score of 110?**
expand.grid
creates a data.frame from all combinations of the supplied variables. unique
finds all possible values that occurred in the data, so for example in the Gender
column it returns, male
and female
. If you are curious try running just the parts on the RHS of the equals, i.e.
We can get the estimated mean income using the predict
function
Remember, the estimated mean response is the same as the predicted response. This returns 330 numbers, one for each row of our newdf
data.frame, so let’s store them in a new column called est_mean
in newdf
,
Now, it’s up to us to explore different ways of plotting the results to explore the model. Here’s one idea, mean income against education, with lines joining the intelligence levels, faceted by gender.
Here’s another, mean income against intelligence, just for 12 years of education, but a different line for males and females,
These are on the log scale, but we can backtransform our mean by taking the exponential of our estimated mean.
Fit a new model that adds a Educ:AFTQ interaction, and repeat the above plots. Describe the change in the modelled means.
Let’s add some confidence intervals on the mean response.
Compare the structure of these two commands (i.e. use str
)
In the second line, the interval = "confidence"
argument asks for confidence intervals on the mean response to be returned as well as fitted values. The first command returns a vector of 16 numbers and that’s why it can be saved directly into a new column in newdf
above. The second command returns a list of data.frame with 3 columns. It doesn’t make sense to store three columns in one column so instead we join it to our newdf data.frame with cbind
(short for column bind)
Now we can use this newdf
to plot our confidence interval on the mean response, but to keep our plot simple for now, let’s do one education level for males (see the subset
in the data argument below),
I used different line types just to illustrate which is which.
Another option is to use a ribbon for the interval
Try plotting the estimated mean income for both mean and women with 12 years of Education with confidence intervals. Add colour = Gender
, to the qplot
function to make sure the two lines are drawn separately.
If you want to see the estimated median response on the original scale just back-transform the mean and interval endpoints
Prediction intervals can be produced by using the interval = "prediction"
argument in predict
.
If you try to add the results of predict
to newdf you’ll run into a problem where you have multiple columns called fit
, lwr
and upr
. One way around this is to rename the columns before cbind
ing them to newdf.
Then we can add them to our plot