Goals for today’s lab
Factors are a special type of variable in R used to hold categorical variables. We’ve already seen lots of them. In ST411/511 all of our grouping variables were stored in R as factors. Have a look at the meadowfoam case study:
The Time
variable is a factor. It has two levels, R lingo for two categories. Why are factors so useful? When we tell R to put a factor into a regression, it automatically generates the indicator variables required for us! In class I generated my own indicator variable for early
and then added it to the regression
But I could have just added the Time column (which is a factor) to the regression formula,
Check the output is exactly the same. The TimeEarly
row corresponds to the indicator that Time == "Early"
.
There are two useful things to know how to do with factors. One, is create a factor from a numeric variable, and the other is to reorder the levels of the factor.
Creating a factor is easy, simply use the factor
function. Let’s make a new intensity_f
variable that is a factor version of Intensity.
By default, the levels are put in alphanumerical order, sometimes this is what you want, sometimes it isn’t. You can completely specify the order, by using the levels argument when you create the variable or you can assign a new baseline level with the relevel
function.
lm
is the workhorse in R for fitting multiple regression models. We’ve already seen it for fitting simple linear regression models. Take the previous code as an example:
lm(Flowers ~ Intensity + Time, data = case0901)
The Flowers ~ Intensity + Time
part is called the formula. The nice thing about the Sleuth shorthand is that it closely corresponds to the formula in lm. The only trick is with indicator variables. Sleuth uses capital letters to indicate an collection of indicator variables. In R you need to either use a column that is already a factor or wrap the column name in factor
to guarantee R treats it as a factor.
In Sleuth shorthand, the model
\[
\mu\{\textit{Flowers} | \textit{INTENSITY, TIME}\} = INTENSITY + TIME
\]
corresponds to a model with indicator variables for intensity and time, in lm
we could fit this with
Examine the output. How many parameters are fit? Notice that there is no coefficient for factor(Intens)150
, we only need 5 indicator variables to represent all 6 levels of Intensity. By default lm
uses the first level as a baseline. If we wanted the 600 level to be the baseline, we would relevel
intensity, i.e.
Now there is no coefficient for 600. To keep the output tidy you might prefer to make the intensity factor first then include it in the model
There are a number of ways to add interactions. The simplest is to let R do all the work and use :
in the model formula to indicate an interaction. For example
adds the interaction between Intensity and Time (remember Time is a factor) so this is the separate lines model, \[ \mu\{\textit{Flowers} | \textit{intensity, TIME}\} = \textit{intensity} + \textit{TIME} + \textit{intensity}\times \textit{TIME} \\ \mu\{\textit{Flowers} | \textit{intensity, TIME}\} = \beta_0 + \beta_1 \textit{intensity} + \beta_2 \textit{early} + \beta_3 \textit{intensity}\times \textit{early} \]
You can add squared terms or any other calculated explanatories two ways. Either compute the new variable beforehand and add it to the regression, or compute them in the formula (but you’ll have to wrap them in I())
The ex0923
dataset in Sleuth3
contains observations on income, intelligence scores and years of educations.
Match up the code, shorthand and full model representations of the following multiple regression models relating logarithm of income to education and gender variables.
Code:
lm(log(Income2005) ~ Educ + Gender, data = ex0923)
lm(log(Income2005) ~ Educ + relevel(Gender, ref = "male"), data = ex0923)
lm(log(Income2005) ~ factor(Educ) + Gender, data = ex0923)
lm(log(Income2005) ~ Educ + Gender + Educ:Gender, data = ex0923)
lm(log(Income2005) ~ factor(Educ) + Gender + factor(Educ):Gender, data = ex0923)
Shorthand: \[ \mu\{\textit{log(Income)} | \textit{GENDER, EDUC}\} = \textit{EDUC} + \textit{GENDER} \]\[ \mu\{\textit{log(Income)} | \textit{GENDER, educ}\} = \textit{educ} + \textit{GENDER} + \textit{GENDER} \times \textit{educ} \]\[ \mu\{\textit{log(Income)} | \textit{GENDER, educ}\} = \textit{educ} + \textit{GENDER} \]\[ \mu\{\textit{log(Income)} | \textit{GENDER, EDUC}\} = \textit{EDUC} + \textit{GENDER} + \textit{GENDER} \times \textit{EDUC} \]
Full model:
\[ \mu\{\textit{log(Income)} | \textit{Gender, Education}\} = \beta_0 + \beta_1 \textit{educ7} + \beta_2 \textit{educ8} + \beta_3 \textit{educ9} + \beta_4 \textit{educ10} + … + \beta_{14} \textit{educ20} + \beta_{15}\textit{male} + \\ \beta_{16} \textit{educ7} \times \textit{male} + \beta_{17} \textit{educ8} \times \textit{male } + … + \beta_{29} \textit{educ20} \times \textit{male} \] \[ \mu\{\textit{log(Income)} | \textit{Gender, Education}\} = \beta_0 + \beta_1 \textit{educ7} + \beta_2 \textit{educ8} + \beta_3 \textit{educ9} + \beta_4 \textit{educ10} + \ldots + \beta_{14} \textit{educ20} + \beta_{15}\textit{male} \] \[ \mu\{\textit{log(Income)} | \textit{Gender, Education}\} = \beta_0 + \beta_1 \textit{educ} + \beta_2 \textit{male} \] \[ \mu\{\textit{log(Income)} | \textit{Gender, Education}\} = \beta_0 + \beta_1 \textit{educ} + \beta_2 \textit{male} + \beta_3 \textit{male} \times \textit{educ} \] \[ \mu\{\textit{log(Income)} | \textit{Gender, Education}\} = \beta_0 + \beta_1 \textit{educ} + \beta_2 \textit{female} \]
Try running the code if you get stuck.