Stat 411/511

Lab 8

Feb 24


  • Preforming the calculations needed to adjust for serial correlation

We will use a dataset that contains yearly measurements of the death rate due to firearms and motor vehicles in the US from 1968 to 1993:


In lab we will answer the question, “Is there a trend in firearm deaths?”.

Let’s start with a plot of the number of deaths against Year:

qplot(Year, FirearmDeaths, data = ex1516, geom = c("point","line"))
qplot(Year, FirearmDeaths, data = ex1516, geom = c("point","line")) +
  geom_smooth(method = "lm")

We see a pretty clear increasing trend. With the linear fit added to the plot, we see evidence of “runs”, a number consecutive observations that are above the mean and then below the mean. The grey band is the usual standard error on the estimated mean response. If there is positive serial correlation, the width of this band will be underestimated (i.e. the true uncertainty in the mean is probably bigger than we are admitting in this plot).

In order to examine the amount of serial correlation and the appropriateness of the autoregressive of order 1 (AR(1)) model, we fit the model we are interested in and examine the residuals

fit_lm <- lm(FirearmDeaths ~ Year, data = ex1516)
qplot(Year, .resid, data = fit_lm, geom = c("point","line"))

Our model summary says we estimate that for each year the death rate increases by 0.4561, with a standard error of 0.053. If there is positive serial correlation, that standard error is an underestimate.

The above plot shows the residuals after accounting for a linear trend over time. We still see the runs (although now it’s easier because we compare them to a flat line at zero). To decide whether the AR(1) model is appropriate we examine the partial autocorrelation function (pacf) of the residuals. When an AR(1) model is appropriate, we expect to see the first lag to have the value α, but all other lags zero. Due to sampling error we see the first lag around α and the others near zero. The dashed lines are 95% confidence intervals based on data with no serial correlation. So, if the pacf at a lag goes beyond the confidence interval that is evidence that the partial serial correlation is not zero at that lag.


Take a look at these other pacf plots. Which represent an AR(1) model?


Ok, so an AR(1) model is appropriate for our residual firearm deaths. Let’s estimate the first serial correlation coefficient, we can either take the 2nd value of the autocorrelation function, or the first of the partial autocorrelation function:


To adjust our standard error on the slope against year, we can either multiply by the adjustment factor, or filter both firearms deaths and years. Let’s do the filtering approach. Each explanatory variable has to be filtered in the same way as the response First, let’s create lagged variables, and save our estimate of the first serial correlation coefficient.

n <- nrow(ex1516)
ex1516$Fire_lag1 <- with(ex1516, c(NA, FirearmDeaths[-n]))
ex1516$Year_lag1 <- with(ex1516, c(NA, Year[-n]))
r <- pacf(residuals(fit_lm))$acf[1]

Now we can create our filtered variables, and run the regression with the filtered variables.

ex1516$Fire_filter <- with(ex1516, FirearmDeaths - r*Fire_lag1)
ex1516$Year_filter <- with(ex1516, Year - r*Year_lag1)
fit_filtered_lm <- lm(Fire_filter ~ Year_filter, data = ex1516)

Now our standard error is adjusted for serial correlation (and consequently the p-value too).

There is convincing evidence that there is a trend in firearm deaths (t-test on slope with adjustment for serial correltiaon, p-value = 0.0007). We estimate that each year is associated with an increase in mean number of firearm deaths of 0.41 deaths per 1000 people. With 95% confidence, each year is associated with an increase in mean number of firearm deaths between 0.19 and 0.63 deaths per 1000 people.