Stat 411/511

Lab 6

Feb 11

Goals

  • Averaging over cells, rows and columns in two-way tables
  • Displaying modeled means

As we’ve seen in lecture, a two-way ANOVA is just a special case of multiple regression. There is nothing new in terms of fitting the saturated model, examining residuals and influence, testing it against the additive model with an F-test, and interpreting parameters. There are however some special operations that are particularly useful for data that fits into a two way table.

As an example we’ll use the Seaweed Grazing experiment from class.

library(Sleuth3)
library(ggplot2)
head(case1301)

Remember the raw response is the percentage of the rock covered in seaweed. We saw a transformation was appropriate, so let’s create new variable for the log regeneration ratio.

case1301$log_r_ratio <- with(case1301, log(Cover/(100-Cover)))

As a useful summary we might be interested in the average regeneration ratio at each combination of our two factors. If our data is balanced this also corresponds to our estimates of the mean response at each combination of factors in the saturated model. The aggregate function is very useful for this kind of operation. For example, run

aggregate(log_r_ratio ~ Block + Treat, data = case1301, FUN = mean)

It takes a formula just like lm, on the left hand side of the ~ is the variable contained in the cells of our table, on the right hand side the variables that we want to group by (in the two-way case, our rows and columns). We tell it where those variables live and the final argument, tells aggregate if there is more than one observation in the cell define by Block and Treat to aggregate them using the mean function. The result is a data.frame with the mean of the log regeneration ratio for each combination of Block and Treat. Sometimes it’s easier to display the results in a table. We’ll save the results of the averaging, then use the xtabs function to display it as a two way table, again the formula notation is used.

cell_avgs <- aggregate(log_r_ratio ~ Block + Treat, data = case1301, FUN = mean)
xtabs(log_r_ratio ~ Block + Treat, cell_avgs)

For display in a report I might round these numbers too

round(xtabs(log_r_ratio ~ Block + Treat, cell_avgs), 2)

This also gives us a convenient way to check if the data is balanced. Instead of aggregating over cells using mean we can aggregate by counting the number of observations using length

cell_counts <- aggregate(log_r_ratio ~ Block + Treat, data = case1301, FUN = length)
xtabs(log_r_ratio ~ Block + Treat, cell_counts)

Yup, two observations per cell, this data is balanced.

We can also use aggregate to find Treatment and Block averages

aggregate(log_r_ratio ~ Block, data = case1301, mean)
aggregate(log_r_ratio ~ Treat, data = case1301, mean)

Or since the data is balanced, find the same averages as row and column averages of our cell averages:

round(addmargins(xtabs(log_r_ratio ~ Block + Treat, cell_avgs), FUN = mean),2)

Compare the column means to the aggregating with just Blocks. They are identical. As an example of this not working let’s do the same thing with the Pygmalion data

head(case1302)
pyg_cell_avgs <- aggregate(Score ~ Company + Treat, data = case1302, FUN = mean)
aggregate(Score ~ Company, data = case1302, FUN = mean)
addmargins(xtabs(Score ~ Company + Treat, pyg_cell_avgs), FUN = mean)

Check that the Company averages are different to the average of the columns in the second table.

Going back to our table of averages for the grazing data

round(addmargins(xtabs(log_r_ratio ~ Block + Treat, cell_avgs), FUN = mean),2)

Since the data is balanced we can use this table to find the mean response for each cell using the saturated model. In the saturated model the estimated mean response is simply the average response in the cell, so we can read it right off. For example, the estimated mean response for Block 1 with the control treatment is -1.51. We’ll talk about finding the estimated mean using the additive model later.

The the other useful purpose for the cell averages is plotting. We can use our cell_avgs data.frame,

qplot(Block, log_r_ratio, data = cell_avgs, colour = Treat) +
ylab("Average log regeneratio ratio")

It might be nicer to reorder the block and join treatments with lines:

qplot(reorder(Block, log_r_ratio), log_r_ratio, data = cell_avgs, colour = Treat, geom = "line", group = Treat) +
ylab("Average log regeneratio ratio")

And we might want to add the original data back in to keep in mind the variation about these averages

qplot(reorder(Block, log_r_ratio), log_r_ratio, data = cell_avgs, colour = Treat, geom = "line", group = Treat) +
ylab("Average log regeneratio ratio") +
geom_point(data = case1301)

Choosing between the additive and non-additive model

Nothing new here, we fit both models and test with and extra sum of squares F-test

fit_sat <- lm(log_r_ratio ~ Block + Treat + Block:Treat, data = case1301)
fit_add <- lm(log_r_ratio ~ Block + Treat, data = case1301)
anova(fit_add, fit_sat)

“There is no evidence that the effect of treatment depends on the block (extra sum of squares F-test on interaction, p value = 0.12).”

Given the additive model is adequate, let’s find the estimated mean response for all the cells in our table. The easiest way is to build a new data.frame with all combinations of our levels and use predict

new_data <- expand.grid(
  Block = unique(case1301$Block),
  Treat = unique(case1301$Treat))
new_data$pred <- predict(fit_add, newdata =new_data)

We can present this as a table or plot them

round(xtabs(pred ~ Block + Treat, new_data), 2) # a table
qplot(reorder(Block, pred), pred, data = new_data, colour = Treat, geom = "line", group = Treat)

The above code works equally well, if the saturated model had been used:

new_data$pred_sat <- predict(fit_sat, newdata =new_data)
round(xtabs(pred_sat ~ Block + Treat, new_data), 2)
qplot(reorder(Block, pred_sat), pred_sat, data = new_data, colour = Treat, geom = "line", group = Treat)

Estimating particular contrasts of interest can be achieved in two ways: reparameterizing so that the contrasts of interest is captured by a single parameter, or using the multcomp package to specify the contrast and estimate it (possibly with multiple comparison adjustments).