Chapter 1 Examples of microbiome studies with batch effects

This vignette provides all the analyses performed in the paper ‘Managing Batch Effects in Microbiome Data’ by Yiwen Wang and Kim-Anh Lê Cao.

Packages installation and loading

First, you will need to install then load the following packages from the CRAN and Bioconductor:

# cran.packages <- c('knitr', 'mixOmics', 'xtable', 'ggplot2', 'vegan', 'cluster',
#                   'gridExtra', 'pheatmap', 'ruv', 'lmerTest', 'bapred')
# install.packages(cran.packages)
# bioconductor.packages <- c('sva', 'limma', 'AgiMicroRna', 
#                           'variancePartition', 'pvca')
# if (!requireNamespace('BiocManager', quietly = TRUE))
#     install.packages('BiocManager')
# BiocManager::install(bioconductor.packages, version = '3.8')

library(knitr)
library(xtable) # table
library(mixOmics)
library(sva) # ComBat
library(ggplot2) # PCA sample plot with density
library(gridExtra) # PCA sample plot with density
library(limma) # removeBatchEffect (LIMMA)
library(vegan) # RDA
library(AgiMicroRna) # RLE plot
library(cluster) # silhouette coefficient
library(variancePartition) # variance calculation
library(pvca) # PVCA
library(pheatmap) # heatmap
library(ruv) # RUVIII
library(lmerTest) # lmer
library(bapred) # FAbatch

1.1 Study description

1.1.1 Sponge Aplysina aerophoba study

Sacristán-Soriano et al. studied the potential involvement of bacterial communities from the sponge species A. aerophoba in the biosynthesis of brominated alkaloids (BAs) (Sacristán-Soriano et al. 2011). They compared the microbial composition and BA concentration in two different tissues (ectosome and choanosome) to investigate the relationship between bacterial composition and BA concentration. The authors concluded that differences in bacterial profiles were not only due to tissue variation (the main effect of interest), but also because the samples were run on two separate denaturing gradient gels during processing. Gel thus acted as a technical batch effect as described in Table 1 below.

Table 1. Overview of exemplar datasets with batch effects. We considered microbiome studies from sponge Aplysina aerophoba; organic matter in anaerobic digestion (AD) and mice models with Huntington’s disease (HD).

1.1.2 Anaerobic digestion study

Anaerobic Digestion (AD) is a microbiological process of organic matter degradation that produces a biogas used in electrical and thermal energy production. However, the AD bioprocess undergoes inhibition during its developmental stage that is not well characterised: Chapleur et al. explored microbial indicators that could improve the AD bioprocess’s efficacy and prevent its failure (Chapleur et al. 2016). They profiled the microbiota of 75 AD samples in various conditions. Here we consider two different ranges of phenol concentration as treatments. The experiment was conducted at different dates (5), which constitutes a technical source of unwanted variation (Table 1).

1.1.3 Huntington’s disease study

In their study, Kong et al. reported differences in microbial composition between Huntington’s disease (HD) and wild-type (WT) mice (Kong et al. 2018). However, the establishment of microbial communities was also driven by biological batch effects: the cage environment and sex. Here we consider only female mice to illustrate a special case of a batch \(\times\) treatment unbalanced design. The HD data include 13 faecal mice samples hosted across 4 cages (Table 1).

We load the data and functions that are provided outside the packages.

# load the data
load(file = './datasets/microbiome_datasets.RData')

# load the extra functions
source(file = './Functions.R')
dim(sponge.tss)
## [1] 32 24
dim(ad.count)
## [1]  75 567
dim(hd.count)
## [1]  13 368

Note: the AD data and HD data loaded are raw counts, while sponge data are total sum scaling (TSS) scaled data calculated on raw counts, with no offset.

1.2 Data processing

Here are the processing steps for the raw count microbiome data:

  1. Prefilter the count data to remove features with excess zeroes across all samples
  2. Add an offset of 1 to the whole data matrix — note that this is not ideal but provides a pracitical way to handle zero counts.
  3. Log-ratio transformation with Centered Log Ratio (CLR)

1.2.1 Prefiltering

We use a prefiltering step to remove OTUs for which the sum of counts are below a set threshold (0.01%) compared to the total sum of all counts (Arumugam et al. 2011).

# ad data
ad.index.keep <- which(colSums(ad.count)*100/(sum(colSums(ad.count))) > 0.01)
ad.count.keep <- ad.count[, ad.index.keep]
dim(ad.count.keep)
## [1]  75 231
# hd data
hd.count.keep <- hd.count
dim(hd.count.keep)
## [1]  13 368

Note: The HD data only include 13 samples, which are a small part of a big dataset that has already been prefiltered. We retained all the OTUs in the data and did not redo the prefiltering again.

1.2.2 Adding offset

We need to add an offset of 1 to all count data to handle zeroes for the CLR transformation. As the sponge data were TSS scaled, a small offset is added in this specific case. According to scale invariance principle (Aitchison 1986), it returns the same results with CLR transformation on raw counts or TSS data. However, we recommend starting from the raw counts (not TSS) for those analyses.

# sponge data
sponge.tss <- sponge.tss + 0.01

# ad data
ad.count.keep <- ad.count.keep + 1

# hd data
hd.count.keep <- hd.count.keep + 1

1.2.3 Centered log-ratio transformation

Microbiome data are compostional and with different library sizes. Using standard statistical methods on such data may lead to spurious results and therefore the data must be further transformed. The CLR is the transformation of choice.

# sponge data
sponge.tss.clr <- logratio.transfo(sponge.tss, logratio = 'CLR')
class(sponge.tss.clr) <- 'matrix' 

# ad data
ad.clr <- logratio.transfo(ad.count.keep, logratio = 'CLR')
class(ad.clr) <- 'matrix' 

# hd data
hd.clr <- logratio.transfo(hd.count.keep, logratio = 'CLR')
class(hd.clr) <- 'matrix'

The final CLR data of sponge study, AD study and HD study contain 32 samples and 24 OTUs, 75 samples and 231 OTUs, 13 samples and 368 OTUs, respectively as described in Table 1.

Bibliography

Sacristán-Soriano, Oriol, Bernard Banaigs, Emilio O Casamayor, and Mikel A Becerro. 2011. “Exploring the Links Between Natural Products and Bacterial Assemblages in the Sponge Aplysina Aerophoba.” Applied and Environmental Microbiology 77 (3). Am Soc Microbiol: 862–70.

Chapleur, Olivier, Céline Madigou, Raphaël Civade, Yohan Rodolphe, Laurent Mazéas, and Théodore Bouchez. 2016. “Increasing Concentrations of Phenol Progressively Affect Anaerobic Digestion of Cellulose and Associated Microbial Communities.” Biodegradation 27 (1). Springer: 15–27.

Kong, Geraldine, Kim-Anh Lê Cao, Louise M Judd, ShanShan Li, Thibault Renoir, and Anthony J Hannan. 2018. “Microbiome Profiling Reveals Gut Dysbiosis in a Transgenic Mouse Model of Huntington’s Disease.” Neurobiology of Disease. Elsevier.

Arumugam, Manimozhiyan, Jeroen Raes, Eric Pelletier, Denis Le Paslier, Takuji Yamada, Daniel R Mende, Gabriel R Fernandes, et al. 2011. “Enterotypes of the Human Gut Microbiome.” Nature 473 (7346). Nature Publishing Group: 174.

Aitchison, John. 1986. The Statistical Analysis of Compositional Data. Chapman; Hall London.