Alpha diversity analysis: Measuring within-sample microbial diversity

Author

Ana Schaan (aschaan@ikmb.uni-kiel.de)

Preparation

This section focuses on understanding and analyzing alpha diversity, which reflects the diversity within individual microbial communities.
We will learn how to calculate different alpha diversity metrics using the phyloseq package and visualize the results using the ggplot2 package. We will also explore statistical approaches to compare alpha diversity across different groups of samples, giving you practical insights into microbiome data interpretation.

Loading essential libraries

Code
# 'phyloseq' -- An R package for reproducible interactive analysis and graphics of microbiome census data
library(phyloseq)
# 'vegan' -- Ordination methods, diversity analysis and other functions for community and vegetation ecologists
library(vegan) 
# 'ggplot2' -- Used for creating graphics and visualizations.
library(ggplot2) 
# 'ggpubr' -- Also used for creating graphics and visualizations.
library(ggpubr)
# 'cowplot' -- Also used for creating graphics and visualizations.
library(cowplot)
# 'dplyr' -- Used for data manipulation; contains dozens of very useful fuctions
library(dplyr)

Set working directory and import the files we will use for downstream analyses

Code
setwd("~/kmc_workshop")
path <- paste0("~/kmc_workshop/inputs/")

# Import our feature table, establishing the first row as the rownames, and converting it to a matrix.
# This table contains our abundance data for each taxa across samples
featureTable <- read.table(paste0(path,"feature-table.tsv"), header=T, check.names=F, row.names=1)
featureTable <- as.matrix(featureTable)

# Import our taxonomy table
# This table links the ASV names/codes to its corresponding taxonomic classification
taxTable <- read.csv(paste0(path,"taxonomy_form.csv"), header=TRUE, row.names=1)
taxTable <- as.matrix(taxTable)

# Import our rooted tree (read_tree is a function from the `phyloseq` package)
# This newick object represents the phylogenetic relationships between ASVs
Tree <- read_tree(paste0(path,"tree.nwk"))

# Import our metadata table
# This table contains sample-related information (such as lifestyle, age, sex etc)
metadataTable <- read.table(paste0(path,"metadata.tsv"), header=TRUE,sep="\t", row.names=1)

Subset samples

We will subset the dataset to make our workshop more dynamic and individualized! Let’s list the countries with the urbanism information using the count function in R. Choose any two countries from the list below that would like to work with during the next analysis steps. Don’t worry, you can always come back and change your selection if you’d like.

Code
metadataTable %>% count(country, urbanism)
Code
# Choose the two countries you'd like to work with and subset the metadata table

subset_metadata <- subset(metadataTable, country == "Ghana" | country == "Rwanda")

Import and integrate data into the phyloseq package

We will combine our datatypes into a single phyloseq object, which allows us to store the microbiome data and access all of the package’s functions

Code
# Component 1: the OTU/ASV table 
features <- otu_table(featureTable, taxa_are_rows = TRUE)

# Component 2: taxonomy containing Kingdom, Phylum, Class, Order, Family, Genus, Species
taxonomy <- tax_table(taxTable)

# Component 3: phylogenetic tree
tree <- Tree

# Component 4: metadata
metadata <- sample_data(subset_metadata)

# Combine all into a single phyloseq object
ps <- phyloseq(features, taxonomy, metadata, tree)

# Check the class of the created object to confirm it's a phyloseq object
class(ps)

# Print a summary of the phyloseq object
ps

# Let's save this object so we can use it in another script later
save(ps, file = "~/kmc_workshop/R_objects/subset_phyloseq_object.RData")

# To use this object again, just run: load("phyloseq_object.RData")

Data exploration

Explore the phyloseq object

Code
nsamples(ps) # Number of samples
ntaxa(ps) # Number of ASVs
rank_names(ps) # Taxonomy levels
sample_variables(ps) # What metadata
sample_sums(ps) # How many reads per sample

Distribution of Reads per ASV

Let’s take a look at the structure of our data using ggplot2

Code
# Create a summary table with total read count per ASV
dataTable <- data.frame(tax_table(ps), TotalCounts = taxa_sums(ps), ASV = taxa_names(ps))

# Use `ggplot2` to visualize the distribution of these reads with a histogram
ggplot(dataTable, aes(TotalCounts)) + 
  geom_histogram() + ggtitle("Distribution of Reads per ASVs") + 
  labs(x = "Total Reads per ASV", y = "Number of ASVs")

# Save the plot in our working directory for future reference
ggsave("Reads_per_ASV.pdf") 

We can observe that most of the ASVs in our dataset are rare. In microbiome studies, it is common to find that the majority of the microbial taxa have low abundance, while only a small number of ASVs dominate the community.

  • Rare Taxa: The high frequency of ASVs with low read counts suggests that a large portion of the microbial community consists of rare taxa.

  • Dominant Taxa: The smaller number of ASVs with high read counts represent the core microbial community.

  • Implications for Analysis: The distribution of reads per ASV is important to interpret diversity metrics and when considering normalization or filtering. For example, we might need to decide whether to include or exclude rare taxa in downstream analyses, as they can influence the results of alpha and beta diversity calculations.

Alpha Diversity Analysis

Manual Calculation of Alpha Diversity Metrics

Next, let’s manually calculate a few key alpha diversity metrics for a single sample. This will give you a sense of what each metric represents. By calculating the metrics manually, you will get a closer look at what is taken into account in each formula, which will later be compared to the automatically generated values using the phyloseq package.

Code
# Randomly select a sample from your subsetted Feature table data
subset_samples <- featureTable[, colnames(featureTable) %in% rownames(subset_metadata)]

random_sample <- sample(colnames(subset_samples), size=1)
random_sample

# Calculate the total number of reads in the random sample to use as the denominator for proportions.
total_reads <- sum(subset_samples[, random_sample])

# Calculate the proportion of reads assigned to each ASV in the random sample
proportions <- subset_samples[, random_sample] / total_reads

Number of Observed ASV (or Observed Richness)

This metric is as straightforward as it sounds: simply the count of unique ASVs present in your sample. We calculate it by summing the ASVs that have a read count greater than zero.

Code
observed_richness_manual <- sum(subset_samples[, random_sample] > 0)
observed_richness_manual

Shannon Diversity

Shannon diversity accounts for both the abundance and evenness of the ASVs in the sample. We calculate it by summing the proportion of reads * log(proportion) for each ASV.

Code
shannon_manual <- -sum(proportions[proportions > 0] * log(proportions[proportions > 0]))
shannon_manual

Chao1

Chao1 accounts for unseen species by considering singletons (ASVs with only one read) and doubletons (ASVs with two reads). It considers an estimate of the number of species we were not able to detect in our sequencing efforts. If we observe species with very few reads, it is likely that other rare species are present but have not been detected. It does not account for evenness.

Code
singleton <- sum(subset_samples[, random_sample] == 1) # Number of singletons
doubleton <- sum(subset_samples[, random_sample] == 2) # Number of doubletons

chao1_manual <- observed_richness_manual + (singleton^2 / (2 * max(doubleton, 1)))
chao1_manual

Alpha Diversity Calculations using phyloseq

Code
# Estimate richness metrics available in `phyloseq` package using our ps object
alpha_diversity <- estimate_richness(ps)

# Quickly ensuring readability:
rownames(alpha_diversity) <- gsub("^X", "", rownames(alpha_diversity))
alpha_diversity$sample <- rownames(alpha_diversity)

# Let's see our alpha diversity values
head(alpha_diversity)

# Let's see how our random sample's results compare to our manually calculated results
subset(alpha_diversity, sample == random_sample)

Visualizations using ggplot2

In the following steps, we will implement the ggplot2 package to create and explore various visualization options. We will integrate our metadata to our alpha diversity results to provide context to our results and explore how diversity metrics are distributed across sample groups.

Code
# Convert our sample_data to a data frame
metadata <- sample_data(ps)

# Combine alpha diversity metrics and sample data
alpha_div_meta <- cbind(metadata, alpha_diversity)

# Let's check what our data frame looks like
head(alpha_div_meta)

First, we will plot how Shannon diversity is distributed across Lifestyle groups. Using boxplots is the most common visualization for comparing alpha diversity across categorical variables. Feel free to explore the other metrics and categorical sample groups to obtain further insights.

Code
# In this plot, Lifestyle categories will be plotted on the X axis and Shannon diversity index values on the Y axis.

p_Shannon_urbanism <- ggplot(alpha_div_meta, aes(x = urbanism, y = Shannon, fill = urbanism)) +
  geom_boxplot() + # This command specifies we will plot using boxplots
  theme_minimal() +
  ggtitle("Shannon Diversity Across Urbanism Groups") +
  labs(x = "Urbanism", y = "Shannon Diversity")

p_Shannon_urbanism

# Save the plot for future reference
ggsave("~/kmc_workshop/results/alphadiv_shannon_urbanism_boxplot.pdf")

Another option is using violin plots, which show the distribution shape in a different way.

Code
p_Shannon_country <- ggplot(alpha_div_meta, aes(x = locality, y = Shannon, fill = country)) +
  geom_violin(trim = FALSE) +
  theme_minimal() + geom_jitter(width=0.1) +
  ggtitle("Shannon Diversity Across Countries") +
  labs(x = "Country", y = "Shannon Diversity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p_Shannon_country

# Save the plot for future reference
ggsave("~/kmc_workshop/results/alphadiv_shannon_country_violin.pdf")

We can also explore relationships between alpha diversity and continuous metadata variables, such as age, and latitude. Let’s see this using a scatter plot.

Code
# Scatter plot of Shannon diversity vs. age

p_Shannon_country <- ggplot(alpha_div_meta, aes(x = age, y = Shannon)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +  # Add a linear regression line
  theme_minimal() + stat_cor() +
  ggtitle("Shannon Diversity vs. Age") +
  labs(x = "Age", y = "Shannon Diversity")

p_Shannon_country

# Save the plot for future reference
ggsave("~/kmc_workshop/results/alphadiv_shannon_age_scatterplot.pdf")

In ggplot2 we also have the option of using facets to compare diversity across a combination of variables, such as urbanism and sex.

Code
# Faceted boxplots of Shannon diversity vs Age and Urbanism

p_Shannon_scatter <- ggplot(alpha_div_meta, aes(x = urbanism, y = Shannon, fill = sex)) +
  geom_boxplot() + geom_jitter(width=0.1) +
  facet_wrap(~ sex) + # Here we establish the facets
  theme_minimal() +
  ggtitle("Shannon Diversity by Urbanism and Sex") +
  labs(x = "Urbanism", y = "Shannon Diversity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p_Shannon_scatter

# Save the plot for future reference
ggsave("~/kmc_workshop/results/alphadiv_shannon_ageUrbanism_facets.pdf")

ASK: Try with other variables!

Now that you know how to create plots, apply similar steps to create a boxplot for the Simpson index across countries.

Code
p_Simpson_lifestyle <- ggplot(alpha_div_meta, aes(x = country, y = Simpson, fill = country)) +
  geom_boxplot() + geom_jitter(width=0.1) +
  theme_minimal() +
  ggtitle("Simpson Diversity Across Countries") +
  labs(x = "Country", y = "Simpson Diversity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Display the plot
p_Simpson_lifestyle
Code
# Run your own code here:

Rarefaction in alpha diversity analysis

In our exploration of alpha diversity, it is essential to consider whether we should rarefy our data. Rarefaction is a technique used to ensure that diversity estimates are comparable across samples with different levels of sampling depth (number of reads).

It is likely that diversity increases according to the number of reads each sample has. Let’s take a look below.

Code
# We will plot the number of ASVs per sample vs Shannon diversity indexes
ggplot(alpha_div_meta, aes(x = sample_sums(ps), y = Shannon)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  ggtitle("Shannon Diversity vs. Sequencing Depth") +
  labs(x = "Sequencing Depth", y = "Shannon Diversity")

# Next, let's plot number of ASVs per sample vs Number of Observed ASVs

ggplot(alpha_div_meta, aes(x = sample_sums(ps), y = Observed)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  ggtitle("Number of Observed ASVs vs. Sequencing Depth") +
  labs(x = "Sequencing Depth", y = "Number of Observed ASVs")

The plot demonstrates that deeper sequencing results in higher diversity estimates, which can be a source of bias. To ensure that diversity comparisons between samples are not influenced by differences in sequencing depth, we can apply rarefaction. This standardizes the number of reads across samples, allowing for more accurate and comparable diversity estimates.

To decide how we will standardize, we can plot a rarefaction curve using the vegan package. A rarefaction curve plots the number of observed ASVs against the number of reads sampled. It shows how species richness accumulates as more reads are added.

Code
# Using the `rarecurve` function, we will plot a rarefaction curve.
rarecurve(t(subset_samples), step = 200, cex = 0.6)

# Let's zoom in on the X axis identify where we can establish a cut off point
rarecurve(t(subset_samples), step = 200, cex = 0.6, xlim=c(0,20000))

Choosing the Sample Depth Cutoff

When rarefying, select a cutoff that balances keeping as many samples as possible while standardizing sequencing depth across all samples. Today we’ll use a cutoff of 8000 sequences per sample, ensuring enough samples are retained for downstream analysis.

Code
# Use the rarefy_even_depth function from the phyloseq package

ps.rare = rarefy_even_depth(ps, rngseed=1, sample.size=8000, replace=F)

# Let's save this phyloseq object in our environment in case we need it later:
save(ps.rare, file = "R_objects/subset_rare_phyloseq_object.RData")

# Let's take a look at the rarefied phyloseq object
ps.rare

Using our rarefied phyloseq object, let’s recalculate and re-plot the alpha diversity measurements.

Code
# Estimate richness metrics available in `phyloseq` package using our ps object
alpha_diversity_rare <- estimate_richness(ps.rare)

# Quickly ensuring readability:
rownames(alpha_diversity_rare) <- gsub("^X", "", rownames(alpha_diversity_rare))
alpha_diversity_rare$sample <- rownames(alpha_diversity_rare)

# Let's see our alpha diversity values
head(alpha_diversity_rare)

# Combine alpha diversity metrics and sample data
alpha_div_rare_meta <- cbind(metadata, alpha_diversity)

Let’s visualize our results and compare with non-rarefied data. We will plot both figures side by side using the cowplot package.

Code
# In this plot, Urbanism categories will be plotted on the X axis and Shannon diversity index values on the Y axis. We will use

p_Shannon_urbanism_rare <- ggplot(alpha_div_rare_meta, aes(x = urbanism, y = Shannon, fill = urbanism)) +
  geom_boxplot() + # This command specifies we will plot using boxplots
  theme_minimal() +
  ggtitle("Rarefied to 8000 reads") +
  labs(x = "Urbanism", y = "Shannon Diversity") + theme(legend.position = "none")

plot_grid(p_Shannon_urbanism, p_Shannon_urbanism_rare, nrow=1)

# Save the plot for future reference
ggsave("~/kmc_workshop/results/alphadiv_shannon_lifestyle_boxplot_rare.pdf")

TASK: Try rarefying your data to a lower depth. How does that affect the amount of OTUs you have to work with?

Code
# Use the rarefy_even_depth function from the phyloseq package

ps.rare2 = rarefy_even_depth(ps, rngseed=1, sample.size=4000, replace=F)

# Let's take a look at the rarefied phyloseq object
ps.rare2
Code
# Run your own code here:

Statistical Analysis of Alpha Diversity

Now that we have visualized the alpha diversity metrics using both rarefied and non-rarefied data, it’s time to move on to statistical analyses. This step will help us determine whether the differences in alpha diversity between different groups are statistically significant.

We will: 1. Test for Normality: Assess the distribution of alpha diversity metrics to determine whether parametric or non-parametric tests are appropriate. 2. Apply Statistical Tests: Based on the normality test results, we will apply either parametric tests (t-tests, ANOVA) or non-parametric tests (Wilcoxon, Kruskal-Wallis) to compare alpha diversity across groups. 3. Visualize Group Comparisons: Generate plots with statistical annotations to visually interpret the differences between groups.

Code
# Let's assess whether the alpha diversity data we will plot follows a normal distribution (which will influence whether to use a parametric or non-parametric test)

# Test normality of Shannon diversity using Shapiro-Wilk test:
shapiro.test(alpha_diversity$Shannon)

If data is normally distributed (p-value > 0.05)

Code
# For 2 groups, use a t-test.
t.test(Shannon ~ urbanism, data = alpha_div_meta)

# For multiple groups, a One-way ANOVA:
anova_result <- aov(Shannon ~ country, data = alpha_div_meta)
summary(anova_result)

If data is NOT normally distributed (p-value < 0.05)

Code
# For 2 groups, use a Wilcoxon rank-sum test (also known as the Mann-Whitney U test)
wilcox.test(Shannon ~ urbanism, data = alpha_div_meta)

# For multiple groups, use Kruskal-Wallis test
kruskal.test(Shannon ~ country, data = alpha_div_meta)

Post-hoc tests

If ANOVA or Kruskal-Wallis tests show significant differences, we can use post-hoc tests to determine which groups differ from each other.

Code
# For ANOVA, use Tukey’s Honest Significant Differences (Tukey HSD) for post-hoc analysis.

# Post-hoc test for ANOVA
TukeyHSD(anova_result)

# For Kruskal-Wallis, use pairwise comparisons with Wilcoxon tests (adjusting for multiple comparisons).

# Post-hoc pairwise comparisons using Wilcoxon test
pairwise.wilcox.test(alpha_div_meta$Shannon, alpha_div_meta$country, p.adjust.method = "BH")

Visualization

Using the stat_compare_means function of the ggpubr package, we can add p values to our plots

Code
# Using the method and the label arguments, we can select the tests to be performed and how they will be displayed

ggplot(alpha_div_rare_meta, aes(x = urbanism, y = Shannon, fill = urbanism)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Shannon Diversity by Urbanism",
       x = "Urbanism",
       y = "Shannon Diversity") + 
       stat_compare_means(method = "wilcox.test", label = "p.format", # Also try changing the "p.format" to "p.signif" to see a different type of annotation
                       label.y = 6)  # Adjusting the label position
       
ggsave("~/kmc_workshop/results/shannon_urbanism_rare_pvalue.pdf")

TASK: If you have time to spare, how about trying to plot a violin plot that compares Observed Richness between urbanism levels with countries as facets?

Code
ggplot(alpha_div_rare_meta, aes(x = urbanism, y = Observed, fill = urbanism)) +
  geom_violin(trim = FALSE) + geom_jitter(width=0.1) + 
  theme_minimal() +
  labs(title = "Observed Richness by Urbanism across Countries",
       x = "Urbanism",
       y = "Observed Richness") +
  facet_wrap(~ country) +  # Facet by country
  stat_compare_means(method = "wilcox.test", label = "p.format", label.y=600) +  # You can also try "p.signif"
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Adjust x-axis text for better readability
Code
# Run your own code here: