Chapter 3 Batch effect adjustment

3.1 Accounting for batch effects

Methods that account for batch effects estimate unknown batch effects through matrix decomposition and / or assign a known or estimated batch as a covariate with linear models.

3.1.1 Linear model and linear mixed model

LM and LMM are suitable for known batch effects, and can consider batch x treatment interaction and deal with unbalanced batch x treatment design. But they are univariate and rely on a Gaussian likelihood assumption, which may not apply to zero-inflated microbiome data despite CLR transformation.

We fit a linear model with effect of interest and batch effect for each OTU in both sponge and AD data. The P-value for the regression coefficient associated with the effect of interest in a linear model is then extracted and multiple testing corrected with “FDR”.

# Sponge data
sponge.trt_p <- apply(sponge.tss.clr, 2, FUN = function(x){
  res.lm <- lm(x ~ sponge.trt + sponge.batch)
  summary.res <- summary(res.lm)
  p <- summary.res$coefficients[2,4]
})

sponge.trt_adjp <- p.adjust(sponge.trt_p, method = 'fdr')

# AD data
ad.trt_p <- apply(ad.clr, 2, FUN = function(x){
  res.lm <- lm(x ~ ad.trt + ad.batch)
  summary.res <- summary(res.lm)
  p <- summary.res$coefficients[2,4]
})

ad.trt_adjp <- p.adjust(ad.trt_p, method = 'fdr')

As the batch x treatment design of HD data is unbalaced, we fit a linear mixed model considering batch (cage) as random effects.

# HD data
hd.trt_p <- apply(hd.clr, 2, FUN = function(x){
  res.lmm <- lmer(x ~ hd.trt + (1|hd.batch))
  summary.res <- summary(res.lmm)
  p <- summary.res$coefficients[2,5]
})

hd.trt_adjp <- p.adjust(hd.trt_p, method = 'fdr')

3.1.2 SVA

SVA can account for unknown batch effects, but is a univariate approach that relies on a Gaussian likelihood assumption and implicitly introduces a correlation between treatment and batch.

The sva function performs two different steps. First it identifies the number of latent factors that need to be estimated. The number of factors can be estimated using the function num.sv.

We first fit a full model with effect of interest and a null model with no effect, and use the full model in the function num.sv to estimate the number of latent factors.

# sponge data
sponge.mod <- model.matrix( ~ sponge.trt) # full model
sponge.mod0 <- model.matrix( ~ 1,data = sponge.trt) # null model
sponge.sva.n <- num.sv(dat = t(sponge.tss.clr), mod = sponge.mod)

Next we apply the sva function to estimate the surrogate variables with both full and null models:

sponge.sva <- sva(dat = t(sponge.tss.clr), mod = sponge.mod, 
                 mod0 = sponge.mod0, n.sv = sponge.sva.n)
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

We then include the estimated surrogate variables in both the null and full models. The reason is that we want to adjust for the surrogate variables, so we treat them as adjustment variables that must be included in both models. The f.pvalue function is then used to calculate parametric F-test P-values and Q-values (adjusted P-values) for each OTU of sponge data.

sponge.mod.bat <- cbind(sponge.mod, sponge.sva$sv)
sponge.mod0.bat <- cbind(sponge.mod0, sponge.sva$sv)

sponge.sva.trt_p <- f.pvalue(t(sponge.tss.clr), sponge.mod.bat, sponge.mod0.bat)
sponge.sva.trt_adjp <- p.adjust(sponge.sva.trt_p, method='fdr')

Now these P-values and Q-values are accounting for surrogate variables (estimated batch effects).

We also apply SVA on both AD and HD data.

# ad data
ad.mod <- model.matrix( ~ ad.trt)
ad.mod0 <- model.matrix( ~ 1, data = ad.trt)
ad.sva.n <- num.sv(dat = t(ad.clr), mod = ad.mod)
ad.sva <- sva(t(ad.clr), ad.mod, ad.mod0, n.sv = ad.sva.n)
## Number of significant surrogate variables is:  6 
## Iteration (out of 5 ):1  2  3  4  5
ad.mod.bat <- cbind(ad.mod, ad.sva$sv)
ad.mod0.bat <- cbind(ad.mod0, ad.sva$sv)
ad.sva.trt_p <- f.pvalue(t(ad.clr), ad.mod.bat, ad.mod0.bat)
ad.sva.trt_adjp <- p.adjust(ad.sva.trt_p, method = 'fdr')

# hd data
hd.mod <- model.matrix( ~ hd.trt)
hd.mod0 <- model.matrix( ~ 1,data = hd.trt)
hd.sva.n <- num.sv(dat = t(hd.clr), mod = hd.mod)
hd.sva <- sva(t(hd.clr), hd.mod, hd.mod0, n.sv = hd.sva.n)
## Number of significant surrogate variables is:  3 
## Iteration (out of 5 ):1  2  3  4  5
hd.mod.bat <- cbind(hd.mod, hd.sva$sv)
hd.mod0.bat <- cbind(hd.mod0, hd.sva$sv)
hd.sva.trt_p <- f.pvalue(t(hd.clr), hd.mod.bat, hd.mod0.bat)
hd.sva.trt_adjp <- p.adjust(hd.sva.trt_p, method = 'fdr')

3.1.3 RUV2

RUV2 estimates and accounts for unknown batch effects, but the method requires negative control variables that are affected by batch effects but not treatment effects.

Practically, negative control variables are chosen prior during the experimental design so that they are not affected by treatments (and assume they are affected by batch effects). RUV2 can only account for the difference captured by these controls.

Since our three datasets do not have negative control variables, we use a linear model (or linear mixed model) to identify OTUs that are not affected by treatment, those will be considered as pseudo negative control variables to fit the assumptions of RUV2. The P-values of treatment effects calculated in section ‘linear model and linear mixed model’ are therefore used here.

Note: This analysis is not optimal as we use pseudo negative control variables, but we wish to provide this as a practical example.

# sponge data
sponge.nc <- sponge.trt_adjp > 0.05

# AD data
ad.nc <- ad.trt_adjp > 0.05

# HD data
hd.nc <- hd.trt_p > 0.05

Note: In HD data, there is no OTU having statistically siginificant difference between genotypes. We use P-values instead of adjusted P-values to set the negative controls.

We then apply the RUV2 function. This function needs to specify the number of unwanted factors (components of negative controls), which we arbitrarily set to k = 3. We then extract the P-values of treatment effects considering unwanted batch effects after FDR adjustment.

# sponge data
sponge.ruv2 <- RUV2(Y = sponge.tss.clr, X = sponge.trt, 
                    ctl = sponge.nc, k = 3) # k is subjective
sponge.ruv2.trt_p <- sponge.ruv2$p
sponge.ruv2.trt_adjp <- p.adjust(sponge.ruv2.trt_p, method="fdr")

# AD data
ad.ruv2 <- RUV2(Y = ad.clr, X = ad.trt, 
                ctl = ad.nc, k = 3) # k is subjective
ad.ruv2.trt_p <- ad.ruv2$p
ad.ruv2.trt_adjp <- p.adjust(ad.ruv2.trt_p, method="fdr")

# HD data
hd.ruv2 <- RUV2(Y = hd.clr, X = hd.trt, 
                ctl = hd.nc, k = 3) # k is subjective
hd.ruv2.trt_p <- hd.ruv2$p
hd.ruv2.trt_adjp <- p.adjust(hd.ruv2.trt_p, method="fdr")

3.1.4 RUV4

RUV4 is an updated version of RUV2 that uses negative control variables and the residual matrix that has no treatment effect to estimate unwanted batch effects.

RUV4 also requires to specify the number of unwanted factors as RUV2, which we estimate with the function ‘getK’ (from the ruv package). This function is only for RUV4 and the estimated k is not always suitable. If the estimated k is 0, k can be set to 1 instead to account for unwanted variation captured from negative controls.

# sponge data
sponge.k.obj <- getK(Y = sponge.tss.clr, X = sponge.trt, ctl = sponge.nc)
sponge.k <- sponge.k.obj$k
sponge.k <- ifelse(sponge.k !=0, sponge.k, 1)
sponge.ruv4 <- RUV4(Y = sponge.tss.clr, X = sponge.trt, ctl = sponge.nc, k = sponge.k) 
sponge.ruv4.trt_p <- sponge.ruv4$p
sponge.ruv4.trt_adjp <- p.adjust(sponge.ruv4.trt_p, method="fdr")

# AD data
ad.k.obj <- getK(Y = ad.clr, X = ad.trt, ctl = ad.nc)
ad.k <- ad.k.obj$k
ad.k <- ifelse(ad.k !=0, ad.k, 1)
ad.ruv4 <- RUV4(Y = ad.clr, X = ad.trt, ctl = ad.nc, k = ad.k) 
ad.ruv4.trt_p <- ad.ruv4$p
ad.ruv4.trt_adjp <- p.adjust(ad.ruv4.trt_p, method="fdr")

# HD data
hd.k.obj <- getK(Y = hd.clr, X = hd.trt, ctl = hd.nc)
hd.k <- hd.k.obj$k
hd.k <- ifelse(hd.k !=0, hd.k, 1)
hd.ruv4 <- RUV4(Y = hd.clr, X = hd.trt, ctl = hd.nc, k = hd.k) 
hd.ruv4.trt_p <- hd.ruv4$p
hd.ruv4.trt_adjp <- p.adjust(hd.ruv4.trt_p, method="fdr")

3.2 Correcting for batch effects

An alternative approach to manage batch effects is to remove batch effects from the original microbiome data, then use the corrected data in any subsequent data analysis. Compared with methods accounting for batch effects, batch effect correction methods are practical and enable broader application in a variety of analyses. However, the methods do not consider the correlation (linear) and interaction (non-linear) between batch and treatment effects and may result in over-adjusted batch effects, statistical power loss, and an inability to detect variables associated with the treatment effect.

3.2.1 BMC (batch mean centering)

We center data within a batch across all variables. As a result, each batch mean is standardised to zero. BMC has several disadvantages: it is a univariate method and it is not optimal for non-Gaussian distributed microbiome data.

# Sponge data
sponge.b1 <- scale(sponge.tss.clr[sponge.batch == 1, ], center = TRUE, scale = FALSE)
sponge.b2 <- scale(sponge.tss.clr[sponge.batch == 2, ], center = TRUE, scale = FALSE)
sponge.bmc <- rbind(sponge.b1, sponge.b2)
sponge.bmc <- sponge.bmc[rownames(sponge.tss.clr), ]
# AD data
ad.b1 <- scale(ad.clr[ad.batch == '09/04/2015', ], center = TRUE, scale = FALSE)
ad.b2 <- scale(ad.clr[ad.batch == '14/04/2016', ], center = TRUE, scale = FALSE)
ad.b3 <- scale(ad.clr[ad.batch == '14/11/2016', ], center = TRUE, scale = FALSE)
ad.b4 <- scale(ad.clr[ad.batch == '01/07/2016', ], center = TRUE, scale = FALSE)
ad.b5 <- scale(ad.clr[ad.batch == '21/09/2017', ], center = TRUE, scale = FALSE)
ad.bmc <- rbind(ad.b1, ad.b2, ad.b3, ad.b4, ad.b5)
ad.bmc <- ad.bmc[rownames(ad.clr), ]

3.2.2 ComBat

ComBat requires known and systematic batch effects. If the treatment information is known, the option mod can be fitted with a full model having treatment informationn to efficiently maintain enough treatment variation (see examples below for sponge and AD data). The mod can also be NULL if the treatment information is unknown.

# Sponge data
sponge.combat <- t(ComBat(t(sponge.tss.clr), batch = sponge.batch, 
                          mod = sponge.mod, par.prior = F, prior.plots = F))
## Found2batches
## Adjusting for1covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding nonparametric adjustments
## Adjusting the Data
# AD data
ad.combat <- t(ComBat(t(ad.clr), batch = ad.batch, 
                      mod = ad.mod, par.prior = F, prior.plots = F))
## Found5batches
## Adjusting for1covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding nonparametric adjustments
## Adjusting the Data

3.2.3 removeBatchEffect

removeBatchEffect is a function implemented in the LIMMA package that fits a linear model for each variable given a series of conditions as explanatory variables, including the batch effect and treatment effect. Contrary to a standard linear or linear mixed models (see section ‘linear model and linear mixed model’) that simultaneously estimate treatment and batch effects, removeBatchEffect subtracts the batch effect from the original data, resulting in a residual matrix that contains any and only treatment effect. The option design is the same as option mod in ComBat.

# Sponge data
sponge.limma <- t(removeBatchEffect(t(sponge.tss.clr), batch = sponge.batch, 
                                    design = sponge.mod))
ad.limma <- t(removeBatchEffect(t(ad.clr), batch = ad.batch, 
                                design = ad.mod))

3.2.4 FAbatch

FAbatch is a combination of location-scale adjustment and factor analysis. We fit y with the treatment information and batch with batch information. Since this method only accepts numeric variables, the levels of variables are changed to be numeric (e.g. ‘1’, ‘2’ and so on). However, on our example, FAbatch did not converge, potentially leading to suboptimal results. We therefore did not compare the results from FAbatch with those from other methods.

# sponge data
sponge.fabatch.obj <- fabatch(x = sponge.tss.clr, 
                             y = as.factor(as.numeric(sponge.trt)), 
                             batch = sponge.batch)
sponge.fabatch <- sponge.fabatch.obj$xadj

# ad data
ad.fabatch.obj <- fabatch(x = ad.clr, 
                         y = as.factor(as.numeric(ad.trt)), 
                         batch = as.factor(as.numeric(ad.batch)))
ad.fabatch <- ad.fabatch.obj$xadj

3.2.5 Percentile normalisation

Percentile normalisation (PN) was developed for microbiome data integration. For each batch, it transforms the relative abundance of control samples to their own percentiles while converting the relative abundance of case samples into the percentiles of their corresponding control distribution. The method thus only applies to case-control studies, and may lose a large amount of information as it uses percentiles rather than the original values.

Note: PN applies on TSS scaled data (Gibbons, Duvallet, and Alm 2018). As we added an offset on sponge TSS scaled data, we re-applied the TSS.

# sponge data
sponge.tss <- t(apply(sponge.tss, 1, function(x){x/sum(x)}))
sponge.percentile <- percentile_norm(data = sponge.tss, batch = sponge.batch, 
                                    trt = sponge.trt)


# ad data
# TSS
ad.tss <- t(apply(ad.count.keep, 1, function(x){x/sum(x)}))
ad.percentile <- percentile_norm(data = ad.tss, batch = ad.batch, 
                                trt = ad.trt)

3.2.6 SVD-based method

We center and scale the data before SVD. After SVD, we deflate the first component, which has the highest variation and is assumed to be related to batch effects.

# sponge data
sponge.sd <- apply(sponge.tss.clr, 2, sd) # calculate standard deviation
sponge.mean <- apply(sponge.tss.clr, 2, mean) # calculate mean
sponge.X <- scale(sponge.tss.clr, center = T, scale = T) # center and scale

sponge.m <- crossprod(sponge.X) # generate a square matrix
sponge.m.svd <- svd(sponge.m) # SVD 
# barplot(sponge.m.svd$d)

# extract 1st singular vectors
sponge.a1 <- sponge.m.svd$u[ ,1] 
sponge.b1 <- sponge.m.svd$v[ ,1]

# deflate component 1 from the data
sponge.t1 <- sponge.X %*% sponge.a1 / drop(sqrt(crossprod(sponge.a1)))
sponge.c1 <- crossprod(sponge.X, sponge.t1) / drop(crossprod(sponge.t1))
sponge.svd.defl.matrix1  <- sponge.X - sponge.t1 %*% t(sponge.c1)

# add back mean and standard deviation
sponge.svd <- sponge.svd.defl.matrix1
sponge.svd[1:nrow(sponge.svd), 1:ncol(sponge.svd)] = NA
for(i in 1:ncol(sponge.svd.defl.matrix1)){
  for(j in 1:nrow(sponge.svd.defl.matrix1)){
    sponge.svd[j,i] = sponge.svd.defl.matrix1[j,i]*sponge.sd[i] + sponge.mean[i]
  }
}
# ad data
ad.sd <- apply(ad.clr, 2, sd) # calculate standard deviation
ad.mean <- apply(ad.clr, 2, mean) # calculate mean
ad.X <- scale(ad.clr,center = T, scale = T) # center and scale

ad.m <- crossprod(ad.X) # generate a square matrix
ad.m.svd <- svd(ad.m) # SVD 
# barplot(ad.m.svd$d)

# extract 1st singular vectors
ad.a1 <- ad.m.svd$u[ ,1]
ad.b1 <- ad.m.svd$v[ ,1]

# deflate component 1 from the data
ad.t1 <- ad.X %*% ad.a1 / drop(sqrt(crossprod(ad.a1)))
ad.c1 <- crossprod(ad.X, ad.t1) / drop(crossprod(ad.t1))
ad.svd.defl.matrix1  <- ad.X - ad.t1 %*% t(ad.c1)

# add back mean and standard deviation
ad.svd <- ad.svd.defl.matrix1
ad.svd[1:nrow(ad.svd), 1:ncol(ad.svd)] = NA
for(i in 1:ncol(ad.svd.defl.matrix1)){
  for(j in 1:nrow(ad.svd.defl.matrix1)){
    ad.svd[j,i] = ad.svd.defl.matrix1[j,i]*ad.sd[i] + ad.mean[i]
  }
}

3.2.7 RUVIII

RUVIII needs not only negative control variables as RUV2 and RUV4, but also technical sample replicates. As only AD data include sample replicates, we illustrate RUVIII on these data only.

In contrast to RUV2 and RUV4, RUVIII is a multivariate method that accounts for the dependency between microbial variables, but it is currently limited to the correction of technical and computational sources of batch effects.

Note: RUVIII requires the number of negative control variables to be larger than the sample size in order to fully use the sample information.

# ad data only
ad.replicates <- ad.metadata$sample_name.data.extraction
ad.replicates.matrix <- replicate.matrix(ad.replicates)

ad.ruvIII <- RUVIII(Y = ad.clr, M = ad.replicates.matrix, ctl = ad.nc)
rownames(ad.ruvIII) <- rownames(ad.clr)

Bibliography

Gibbons, Sean M, Claire Duvallet, and Eric J Alm. 2018. “Correcting for Batch Effects in Case-Control Microbiome Studies.” PLoS Computational Biology 14 (4). Public Library of Science: e1006102.