--- title: Inference in the balanced one-way random effect ANOVA date : 2015-03-12 --- &lead ```{r, message=FALSE, echo=FALSE} library(data.table) library(magrittr) library(dplyr) library(knitr) opts_chunk$set(collapse=TRUE, fig.path='assets/fig/AV1Rinference-') assignInNamespace("cedta.override", c(data.table:::cedta.override, "poirot"), "data.table") ``` The one-way random effect ANOVA model is the simplest linear Gaussian model with random effects: one random factor, and no other factor or covariate. However, methods of inference to get estimates, confidence intervals for various parameters (variance components, coefficient of variation), prediction intervals and tolerance intervals, are not so well-known. This is what we provide in this article, in the simplest situation : the case of *balanced data* (at the end we give a possible way to handle the unbalanced situation). Recall that, denoting by $y_{ij}$ the $j$-th response in group $i$. this model assumes that the responses $y_{ij}$ are generated in two steps. First, the group means $\mu_i$ are independently generated according to a Gaussian distribution ${\cal N}(\mu, \sigma^2_b)$ where $\mu$ is the overall mean and $\sigma^2_b$ is the so-called *between variance*. Second, the responses $y_{ij}$, $j =1,\ldots,J$, for each group $i$, are independently distributed according to a Gaussian distribution ${\cal N}(\mu_i, \sigma^2_w)$ with *within variance* $\sigma^2_w$ and mean $\mu_i$. Shortly, the model can be written: $$\begin{cases} \mu_i \sim_{\text{iid}} {\cal N}(\mu, \sigma^2_b) & i=1,\ldots,I \\ (y_{ij} \mid \mu_i) \sim_{\text{iid}} {\cal N}(\mu_i, \sigma^2_w) & j=1,\ldots,J \end{cases}.$$ ![](http://stla.github.io/stlapblog/posts/assets/fig/AV1random-anovarandommodel.png) If you like the moment of inertia representation of the variance on this picture (I do), see [my article about it](http://stla.github.io/stlapblog/posts/Variance_inertia.html). Throughout the article, we will use the following function to get the within sum-of-squares $SS_w$ and the between sum-of-squares $SS_b$: ```{r} SOS <- function(y, group){ require(dplyr) return(data.frame(y=y,group=group) %>% mutate(mean=mean(y)) %>% group_by(group) %>% mutate(groupmean=mean(y)) %>% ungroup %>% do(summarise(., SSw=crossprod(y-groupmean), SSb=crossprod(groupmean-mean))) %>% as.list) } ``` ## Bayesian posterior distributions All Bayesian methods we provide are based on the posterior distributions generated by this code. I'm using a `data.table` for storing the stimulations. ```{r} ranova <- function(y, group, nsims=50000){ require(data.table) group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sum of squares SSb SSw SSb <- SS$SSb; SSw <- SS$SSw omean <- mean(y) k <- tot <- 0 z <- rnorm(nsims) ETAb <- ETAw <- rep(NA,nsims) while(k < nsims){ tot <- tot +1 etab <- SSb/rchisq(1, I-1) etaw <- SSw/rchisq(1, I*(J-1)) if(etab>etaw){ k <- k+1 ETAb[k] <- etab ETAw[k] <- etaw } } out <- data.table(mu=omean + sqrt(ETAb/I/J)*z, sigma2w=ETAw, sigma2b=(ETAb-ETAw)/J) reject <- 100*(tot-nsims)/tot cat("Rejection percentage: ", round(reject,1), "%") setattr(out, "reject", reject) return(out) } ``` These posterior distributions can be found in [Krishnamoorthy & Mathew][KM]'s book. I believe they are derived in [Box & Tiao][BT]'s book. As we can see in the body of the function, the algorithm firstly generates simulations of a couple $(\eta_w, \eta_b)$ whose joint distribution is given two inverse-Gamma distributions ${\cal IG}\left(\frac{I(J-1)}{2},\frac{SS_w}{2}\right)$ and ${\cal IG}\left(\frac{I-1}{2},\frac{SS_b}{2}\right)$ conditionned to $\eta_b > \eta_w$, and then $\sigma^2_w=\eta_w$ $\sigma^2_b=\frac{\eta_b-\eta_w}{J}$. The `ranova` function also prints the percentage of rejected simulations (those for which `etab As an exercise, you can check why the probability of rejection always is $50\%$ for $\rho=1$. ## Simulated data Throughout the article, we will illustrate the methods on two datasets, `dat1` and `dat2`. I load the `magrittr` package to use the pipe operator `%>%`. ```{r} library(magrittr) set.seed(314) I <- 3 J <- 6 mu <- 10 sigmab <- 2 sigmaw <- 1 dat1 <- data.frame(sapply(rnorm(I,mu,sigmab), function(mu) rnorm(J,mu,sigmaw))) %>% setNames(LETTERS[1:I]) %>% stack %>% setNames(c("y", "group")) sigmab <- 1 dat2 <- data.frame(sapply(rnorm(I,mu,sigmab), function(mu) rnorm(J,mu,sigmaw))) %>% setNames(LETTERS[1:I]) %>% stack %>% setNames(c("y", "group")) # stacked datasets dat <- rbind(cbind(data="dat1", dat1), cbind(data="dat2", dat2)) ``` The between-variance $\sigma_b^2$ is the only difference between the parameters used to generate our two datasets ($4$ vs $1$). We can already store the posterior simulations of the Bayesian method: ```{r sims, message=FALSE, cache=TRUE} sims1 <- with(dat1, ranova(y, group, nsims=1e6)) sims2 <- with(dat2, ranova(y, group, nsims=1e6)) # stacked simulations in a data.table sims <- rbind(sims1[, "data":="data1"], sims2[,"data":="data2"]) setkey(sims, data) ``` Note that the theoretical rejection percentages could be calculated with our above formula of $R(SS_w,SS_b)$: ```{r} by(dat, dat$data, FUN=function(data){ with(data, with(SOS(y,group), round(100*pbeta(SSw/(SSw+SSb), I*(J-1)/2, (I-1)/2), 1) ) ) }) ``` Because there is no rejection for the first dataset, the posterior distribution of $\eta_b$ is ${\cal IG}\left(\frac{I-1}{2},\frac{SS_b}{2}\right)$, as confirmed by the density plot below (see this [stats.stackexchange discussion](http://stats.stackexchange.com/questions/65866/good-methods-for-density-plots-of-non-negative-variables-in-r/) if you wonder why I use the `logspline` package): ```{r, fig.width=6, fig.height=4, cache=TRUE} with(dat1, { ssb <- J*crossprod(aggregate(y~group, FUN=mean)$y-mean(y)) curve(dgamma(x,(I-1)/2, ssb/2), to=0.4, lwd=2, xlab=NA, ylab=NA, axes=FALSE) }) axis(1) library(logspline) x <- 1/with(sims1[sample(nrow(sims1),1e4)], J*sigma2b+sigma2w) plot(logspline(x), add = TRUE, col = "red", lwd = 2) ``` But not for the second dataset: ```{r, fig.width=6, fig.height=4} with(dat2, { ssb <- J*crossprod(aggregate(y~group, FUN=mean)$y-mean(y)) curve(dgamma(x,(I-1)/2, ssb/2), to=1, ylim=c(0,2.6), lwd=2, xlab=NA, ylab=NA, axes=FALSE) }) axis(1) lines(density(1/with(sims2, J*sigma2b+sigma2w)), col="red", lwd=2) ``` ## Confidence interval about the overall mean ### Frequentist method There's a [straightforward way to get a confidence interval](http://stla.github.io/stlapblog/posts/ModelReduction.html) about the overall mean $\mu$: just take the group means and draw a usual Gaussian confidence interval about their mean: ```{r} by(dat, dat$data, function(data){ aggregate(y~group, data=data, FUN=mean) %>% lm(y~1, .) %>% confint }) ``` ### Bayesian method Throughout the article we always use equi-tailed credible intervals. ```{r} quantile(sims["data1"]$mu, c(2.5,97.5)/100) quantile(sims["data2"]$mu, c(2.5,97.5)/100) ``` For the first dataset, the Bayesian interval is close to the frequentist interval. We can see why: when $R(SS_w,SS_b)=0$, the posterior distribution of $\eta_b$ is $\eta_b \sim {\cal IG}\left(\frac{I-1}{2},\frac{SS_b}{2}\right)$ and then the posterior distribution of $\mu$ is a non-standardized Student distribution with $I-1$ degrees of freedom, location parameter $\bar y_{\bullet\bullet}$, and scale $\sqrt{\frac{SS_b}{IJ(I-1)}}$. Therefore the Bayesian interval *exactly* coincides with the frequentist interval when $R(SS_w,SS_b)=0$, and this is the case for the first dataset. Otherwise, when $R(SS_w,SS_b)=0$, the Bayesian interval is wider. *Yes, probabilists, I know $R(SS_w,SS_b)$ is never exactly $0$ !* ## Confidence interval about variance components ### Frequentist method The $\chi^2$ distribution of $SS_w$ easily allows to derive a confidence interval about the within-variance $\sigma^2_w$. By the way it is the same as the case of the fixed effects model. Things are more complicated for the between-variance $\sigma^2_b$ and the total variance $\sigma^2_w+\sigma^2_b$. The function `confintVC` below provides confidence intervals following the method given by [Burdick & al][BBM], based on [Graybill & Wang][GW]'s confidence intervals for linear combinations of $\chi^2$ distributions. ```{r} confintVC <- function(y, group, alpha=5/100){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I # DFb <- I-1 # between df DFw <- I*(J-1) # within df SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw MSSb <- SSb/DFb; MSSw <- SSw/DFw # mean sum of squares # Confidence Intervals a <- alpha/2 ## Within variance confidence interval sigma2w <- MSSw # estimate withinLCB <- sigma2w/qf(1-a,DFw,Inf) # Within lwr withinUCB <- sigma2w/qf(a,DFw,Inf) # Within upr ## Between variance confidence interval sigma2b <- (MSSb-MSSw)/J # estimate G1 <- 1-(1/qf(1-a,DFb,Inf)) G2 <- 1-(1/qf(1-a,DFw,Inf)) H1 <- (1/qf(a,DFb,Inf))-1 H2 <- (1/qf(a,DFw,Inf))-1 G12 <- ((qf(1-a,DFb,DFw)-1)^2-(G1^2*qf(1-a,DFb,DFw)^2)-(H2^2))/qf(1-a,DFb,DFw) H12 <- ((1-qf(a,DFb,DFw))^2-H1^2*qf(a,DFb,DFw)^2-G2^2)/qf(a,DFb,DFw) Vu <- H1^2*MSSb^2+G2^2*MSSw^2+H12*MSSb*MSSw Vl <- G1^2*MSSb^2+H2^2*MSSw^2+G12*MSSw*MSSb betweenLCB <- (MSSb-MSSw-sqrt(Vl))/J # Betwen lwr betweenUCB <- (MSSb-MSSw+sqrt(Vu))/J # Between upr ## Total variance confidence interval sigma2tot <- sigma2w+sigma2b # estimate totalLCB <- sigma2tot - (sqrt(G1^2*MSSb^2+G2^2*(J-1)^2*MSSw^2)/J) # Total lwr totalUCB <- sigma2tot + (sqrt(H1^2*MSSb^2+H2^2*(J-1)^2*MSSw^2)/J) # Total upr # Output out <- data.frame( estimate=c(sigma2w, sigma2b, sigma2tot), lwr=c(withinLCB, betweenLCB, totalLCB), upr=c(withinUCB, betweenUCB, totalUCB)) rownames(out) <- c("within", "between", "total") return(out) } ``` ```{r, prompt=FALSE} with(dat1, confintVC(y, group)) with(dat2, confintVC(y, group)) ``` The same estimates are provided by `lmer(y~group)` with the `lme4` package for mixed-models. ### Bayesian method First, let us construct the simulations of the total variance: ```{r, results='hide'} sims[, "sigma2tot":=sigma2w+sigma2b] ``` Let's see the estimates before looking at the credible intervals: ```{r} sapply(sims["data1", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) c(mean=mean(x), median=median(x))) ``` ```{r} sapply(sims["data2", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) c(mean=mean(x), median=median(x))) ``` Clearly, the median is a better estimate than the mean. Now let's have a look at the credible intervals. For the first dataset, the confidence interval about the within-variance is the same as the frequentist confidence interval, as expected, and it is interesting to see that the credible interval is close to the frequentist confidence interval for the other variances: ```{r} sapply(sims["data1", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) quantile(x, c(2.5,97.5)/100)) ``` For the second dataset, it is not expected to get results close to the frequentist results. We get a shorter interval for the within-variance, and wider intervals for the other variances: ```{r} sapply(sims["data2", c("sigma2w", "sigma2b", "sigma2tot"), with=FALSE], function(x) quantile(x, c(2.5,97.5)/100)) ``` ## Confidence interval about coefficient of total variation ### Frequentist method A confidence interval about the coefficient of total variation $\dfrac{\sqrt{\sigma^2_w+\sigma^2_b}}{\mu}$ is provided by the function `ranovaCV` below. The method is based on the generalized $p$-value approach using a generalized pivotal quantity. The procedure performed by this function is an original one: I did it myself, but I don't really know how: I have naively followed a procedure given in [Krishnamoorthy & Mathew][KM]'s book for another parameter of interest. ```{r} ranovaCV <- function(y, group, level=0.95, nsims=100000){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw Z <- rnorm(nsims) Ub <- rchisq(nsims, I-1) Ue <- rchisq(nsims, I*(J-1)) # ! J*(I-1) dans le bouquin mu.gpq <- mean(y)-Z/sqrt(Ub)*sqrt(SSb/I/J) sigmatot.gpq <- 1/sqrt(J)*sqrt(SSb/Ub+(J-1)*SSw/Ue) G <- sigmatot.gpq/mu.gpq low <- as.numeric(quantile(G,(1-level)/2)) # borne inf up <- as.numeric(quantile(G,level+(1-level)/2)) # borne sup return(list(lwr=low, upr=up)) } ``` ```{r} with(dat1, ranovaCV(y, group)) # true value : sqrt(5)/10 ≃ 0.224 with(dat2, ranovaCV(y, group)) # true value : sqrt(2)/10 ≃ 0.1414 ``` ```{r coverageCV, cache=TRUE, eval=FALSE, include=FALSE} mu <- 10 sigmaw <- 1 sigmab <- 1 CV <- sqrt(sigmaw^2+sigmab^2)/mu nsims <- 3000 test <- rep(NA,nsims) for(i in 1:nsims){ bounds <- data.frame(sapply(rnorm(I,mu,sigmab), function(m) rnorm(J,m,sigmaw))) %>% stack %>% setNames(c("y", "group")) %>% with(ranovaCV(y, group)) test[i] <- CV > bounds[1] & CV < bounds[2] } mean(test) ``` ### Bayesian method Similarly to any other parameter of interest, one firstly constructs the simulations of the coefficient of total variation: ```{r, results='hide'} sims[, "CV":=sqrt(sigma2tot)/mu] ``` and then one takes the quantiles: ```{r} sims[, list(lwr=quantile(CV, 2.5/100), upr=quantile(CV, 97.5/100)), by="data"] ``` ## Prediction interval ### Frequentist method This method is given in [Lin & Liao][LL]. Denote by $y^{\text{new}}$ a new observation. Note that $$ y^{\text{new}} - \bar{y}_{\bullet\bullet} \sim {\cal N}\left(0, \sigma^2_{\text{tot}} + \frac{\tau^2}{I}\right),$$ and the variance of $(y^{\text{new}} - \bar{y}_{\bullet\bullet})$ can be written $$ \left(1+\frac{1}{I}\right)\tau^2 + \left(1-\frac{1}{J}\right)\sigma^2_w. $$ Then knowing that $SS_w \sim \sigma^2_w\chi^2_{I(J-1)}$ and $SS_b \sim J\tau^2\chi^2_{I-1}$, an unbiased estimate of the variance of $(y^{\text{new}} - \bar{y}_{\bullet\bullet})$ is $$ \left(1+\frac{1}{I}\right)\frac{SS_b}{J(I-1)} + \left(1-\frac{1}{J}\right)\frac{SS_w}{I(J-1)} = \frac{I+1}{I(I-1)}\frac{SS_b}{J} + \frac{SS_w}{IJ} $$ This estimate does not follow a Chi-square distribution but we assimilate it as an approximate Chi-square distribution using Satterthwaite estimate. In addition, since we know (or [as shown here](http://stla.github.io/stlapblog/posts/Anova1random.html)) that $SS_b$, $SS_w$ and $(y^{\text{new}} - \bar{y}_{\bullet\bullet})$ are independent, $$ \dfrac{y^{\text{new}} - \bar{y}_{\bullet\bullet}}{\sqrt{\dfrac{I+1}{I(I-1)}\dfrac{SS_b}{J} + \dfrac{SS_w}{IJ}}} \approx \mathrm{t}_\nu, $$ wherefrom we get a prediction interval about $y^{\text{new}}$. ```{r} ranovapred <- function(y, group, conf=0.95){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw a <- (1/J*(1+1/I))/(I-1) b <- 1/I/J v <- a*SSb+b*SSw # estimates the variance of (Ynew-Ybar) nu <- v^2/((a*SSb)^2/(I-1)+(b*SSw)^2/I/(J-1)) # Satterthwaite degrees of freedom alpha <- 1-conf bounds <- mean(y) + c(-1,1)*sqrt(v)*qt(1-alpha/2,nu) return(list(bounds=bounds, std.error=sqrt(v))) } ``` ```{r} by(dat, dat$data, function(data){ with(data, ranovapred(y, group)$bounds) }) ``` ### Bayesian method The Bayesian prediction interval is obtained by taking the quantiles of the posterior predictive distribution. One firstly simulates this distribution: ```{r, results='hide'} sims[, ynew:=rnorm(nrow(sims), mu, sqrt(sigma2tot))] ``` One takes the quantiles: ```{r} sims[, list(lwr=quantile(ynew, 2.5/100), upr=quantile(ynew, 97.5/100)), by="data"] ``` For both datasets, the Bayesian prediction interval is larger than the frequentist one. ## One-sided tolerance interval We provide methods to get lower and upper tolerance limits. Recall this is nothing but a lower/upper confidence bound about a lower/upper quantile, and we will take the lower/upper $90\%$ quantile as an example. The theoretical values are: ```{r} # first dataset qnorm(c(10,90)/100, 10, sqrt(5)) # second dataset qnorm(c(10,90)/100, 10, sqrt(2)) ``` ### Frequentist method The `ranovatol` function provides Krishnamoorthy & Mathew's tolerance limits given in [Krishnamoorthy & Mathew][KM]'s book. ```{r} ranovatol <- function(y, group, p=0.9, conf=0.95, nsims=50000){ group <- factor(group) I <- length(levels(group)) if(!all(I*as.numeric(table(group))==length(y))){ stop("balanced only!") } J <- length(y)/I SS <- SOS(y, group) # sums of squares SSb <- SS$SSb; SSw <- SS$SSw Z <- rnorm(nsims) Ub <- rchisq(nsims, I-1) Ue <- rchisq(nsims, I*(J-1)) # ! J*(I-1) in K & M's book G <- mean(y)-Z/sqrt(Ub)*sqrt(SSb/I/J)-qnorm(p)/sqrt(J)*sqrt(SSb/Ub+(J-1)*SSw/Ue) low <- as.numeric(quantile(G,1-conf)) G <- mean(y) + Z/sqrt(Ub)*sqrt(SSb/I/J) + qnorm(p)/sqrt(J)*sqrt(SSb/Ub+(J-1)*SSw/Ue) up <- as.numeric(quantile(G,conf)) return(c(lower=low, upper=up)) } ``` ```{r} by(dat, dat$data, function(data){ with(data, ranovatol(y, group)) }) ``` ### Bayesian method We proceed as for any parameter of interest. First: ```{r, results='hide'} sims[, c("tol_lower","tol_upper"):=list(qnorm(.1, mu, sqrt(sigma2tot)), qnorm(.9, mu, sqrt(sigma2tot)))] ``` then: ```{r} sims[, list(lower=quantile(tol_lower, 5/100), upper=quantile(tol_upper, 95/100)), by="data"] ``` # What about the unbalanced situation ? I have never tried, but replacing everywhere the sum of squares $SS_w$ and $SS_b$ by the *unweighted sum of squares* should give good results in the unbalanced case. This is known to work well for variance components (see [Burdick & al][BBM]). [KM]: / "Kalimuthu Krishnamoorthy, Thomas Mathew. *Statistical Tolerance Regions: Theory, Applications, and Computation*. Wiley 2009." [BT]: / "George E.P. Box, George C. Tiao. *Bayesian Inference in Statistical Analysis*. Wiley 1992." [LL]: / "Lin Liao. *Prediction intervals for the general balanced linear random models*. Journal of Statistical Planning and Inference 138, 3164-3175, 2008." [BBM]: / "R.K. Burdick, C.M. Borror, D.C. Montgomery. *Design and analysis of gauge R&R studies*. ASA-SAM Series on Statistics and Applied Probability 2005." [GW]: / "F.A. Graybill, C-H. Wang. *Confidence intervals on nonnegative linear combination of variances*. Journal of the American Statistical Association 75:372, 869–873, 1980."