Human gut microbiome research focuses on populations living in high-income countries and to a lesser extent, non-urban agriculturalist and hunter-gatherer societies. The scarcity of research between these extremes limits our understanding of how the gut microbiota relates to health and disease in the majority of the world’s population. Here, we evaluate gut microbiome composition in transitioning South African populations using short- and long-read sequencing. We analyze stool from adult females living in rural Bushbuckridge (n = 118) or urban Soweto (n = 51) and find that these microbiomes are taxonomically intermediate between those of individuals living in high-income countries and traditional communities. We demonstrate that reference collections are incomplete for characterizing microbiomes of individuals living outside high-income countries, yielding artificially low beta diversity measurements, and generate complete genomes of undescribed taxa, including Treponema, Lentisphaerae, and Succinatimonas. Our results suggest that the gut microbiome of South Africans does not conform to a simple “western-nonwestern” axis and contains undescribed microbial diversity.
Stool samples were collected from women aged 40–72 years in Soweto, South Africa and Bushbuckridge Municipality, South Africa. Participants were recruited on the basis of participation in AWI-Gen56, a previous study in which genotype and extensive health and lifestyle survey data were collected. Human subjects research approval was obtained (Stanford IRB 43069, University of the Witwatersrand Human Research Ethics Committee M160121, Mpumalanga Provincial Health Research Committee MP_2017RP22_851) and informed consent was obtained from participants for all samples collected. Participants were not compensated for participation. Stool samples were collected and preserved in OmniGene Gut OMR-200 collection kits (DNA Genotek). Samples were frozen within 60 days of collection as per manufacturer’s instructions, followed by long-term storage at −80 °C. As the enrollment criteria for our study included previous participation in a larger human genomics project56, we had access to self-reported ethnicity for each participant (BaPedi, Ndebele, Sotho, Tsonga, Tswana, Venda, Xhosa, Zulu, Other, or Unknown). Samples from participants who tested HIV-positive or who did not consent to an HIV test were not analyzed. DNA was extracted from stool samples using the QIAamp PowerFecal DNA Kit (QIAGEN Cat. No. 12830) according to the manufacturer’s instructions except for the lysis step, in which samples were lysed using the TissueLyser LT (QIAGEN Cat. No. 85600) (30 s oscillations/3 min at 30 Hz). DNA concentration of all DNA samples was measured using Qubit Fluorometric Quantitation (DS DNA High-Sensitivity Kit, ThermoFisher Cat. No. {“type”:”entrez-protein”,”attrs”:{“text”:”Q32851″,”term_id”:”75280859″,”term_text”:”Q32851″}}Q32851). DNA sequencing libraries were prepared using the Nextera XT DNA Library Prep Kit (Illumina Cat. No. FC-131-1096). Final library concentration was measured using Qubit Fluorometric Quantitation and library size distributions were analyzed with the Bioanalyzer 2100 (Agilent G2939BA). Libraries were multiplexed and 150 bp paired-end reads were generated on the HiSeq 4000 platform (Illumina). Samples with greater than ~300 ng remaining mass and a peak fragment length of greater than 19,000 bp (with minimal mass under 4000 bp) as determined by a TapeStation 2200 (Agilent G2964AA) were selected for nanopore sequencing. Nanopore sequencing libraries were prepared using the 1D Genomic DNA by Ligation protocol (Oxford Nanopore Technologies SQK-LSK109) following standard instructions. Each library was sequenced with a full FLO-MIN106D R9 Version Rev D flow cell on a MinION sequencer for at least 60 h. Literature review criteria based on Brewster et al.4 were employed: PubMed, EMBASE, SCOPUS, and Web of Science were queried for observational and interventional research involving the human gut microbiome through January 2021. Terms including “gut microbiome” and “gut microbiota” and names of each of the 54 African countries were included in the search. Primary reports on the gut microbiome in African children and/or adults, utilizing either 16S rRNA or shotgun metagenomic sequencing and written in English, were included. Abstracts, secondary reports, poster presentations, reviews or editorials, and in vivo and in vitro studies were excluded. The list of relevant articles yielded by this search strategy was manually reviewed. Stool metagenomic sequencing reads were trimmed using TrimGalore v0.6.583 with a minimum quality score of 30 for trimming (–q 30) and minimum read length of 60 (–length 60). Trimmed reads were deduplicated to remove PCR and optical duplicates using htstream SuperDeduper v1.2.0 with default parameters. Reads aligning to the human genome (hg19) were removed using BWA v0.7.17-r118884. Taxonomy profiles were created with Kraken v2.0.9-beta with default parameters85 and (1) a comprehensive custom reference database containing all bacterial and archaeal genomes in GenBank assembled to “complete genome,” “chromosome,” or “scaffold” quality as of January 2020, and (2) the pre-built Struo50 GTDB release 95 database containing one genome per species. Bracken v2.2.0 was then used to re-estimate abundance at each taxonomic rank86. MetaPhlAn352 taxonomy profiles were also generated. Published data from additional adult populations were downloaded from the NCBI Sequence Read Archive or European Nucleotide Archive (Supplementary Table 4) and preprocessed and taxonomically classified as described above. The study by Backhed et al. sampled both mothers and infants: only the maternal samples were retained in this study. For datasets containing longitudinal samples from the same individual, one unique sample per individual was chosen (the first sample from each individual was chosen from the United States Human Microbiome Project cohort). K-mer sketches were computed using sourmash v2.0.060. Low abundance k-mers were trimmed using the “trim-low-abund.py” script from the khmer package v3.0.087 with a k-mer abundance cutoff of 3 (-C 3) and trimming coverage of 18 (-Z 18). Signatures were computed for each sample using the command “sourmash compute” with a compression ratio of 1000 (–scaled 1000) and k-mer lengths of 21, 31, and 51 (-k 21,31,51). Two signatures were computed for each sample: one signature tracking k-mer abundance (–track-abundance flag) for angular distance comparisons, and one without this flag for Jaccard distance comparisons. Signatures at each length of k were compared using “sourmash compare” with default parameters and the correct length of k specified with the -k flag. Unassembled metagenomic reads were functionally profiled using ShortBRED88 v0.9.3 with a pre-built antibiotic resistance database based on the Comprehensive Antibiotic Resistance Database89. Features were pre-filtered for >10% prevalence and statistical analysis was performed using MaAsLin v290 using the compound Poisson linear model (CPLM) and total sum scaling normalization with “site” as a fixed effect. Pangenomes were calculated with PanPhlAn v3.152 using parameters for increased sensitivity recommended by the authors of the tool: “–min_coverage 1–left_max 1.70–right_min 0.30”. MetaCyc pathways were profiled with HUMAnN v3.0.052 with default parameters, using the mpa_v30_CHOCOPhlAn_201901 database. Forward and reverse reads were concatenated into one file per sample prior to processing. Pathway abundances were normalized to copies per million and statistical analysis was performed using MaAsLin v2 using the CPLM and total sum scaling normalization with “site” as a fixed effect. Short-read metagenomic data were assembled with SPAdes v3.1591 and binned into draft genomes using a publicly available workflow (https://github.com/bhattlab/bhattlab_workflows/blob/master/binning/bin_das_tool_manysamp.snakefile, commit version bbe6511 as of Apr 20, 2021). Briefly, short reads were aligned to assembled contigs with BWA v0.7.1784 and contigs were subsequently binned into draft genomes with MetaBAT v2.1592, CONCOCT v1.1.093, and MaxBin v2.2.794. Default parameters were used for each binner, with the following exceptions: For the jgi_summarize_bam_contig_depths step of MetaBAT, minimum contig length was set at 1000 bp (–minContigLength 1000), minimum contig depth of coverage of 1 (–minContigDepth 1), and a minimum end-to-end percent identity of reads of 50 (–percentIdentity 50). Bins were aggregated and refined with DASTool v1.1.195. Bins were evaluated for size, contiguity, completeness, and contamination with QUAST v5.0.296, CheckM v1.0.1397, Prokka v1.14.698, Aragorn v1.2.3899, and Barrnap v0.9 (https://github.com/tseemann/barrnap/). We referred to published guidelines to designate genome quality66. Individual contigs from all assemblies were assigned taxonomic classifications with Kraken v2.0.966,85. To create de-replicated genome collections, genomes with completeness greater than 75% and contamination less than 10% (as evaluated by CheckM) were de-replicated using dRep v3.2.0100 with ANI threshold to form secondary clusters (-sa) at 0.99 (strain-level) or 0.95 (species-level). For comparison to UHGG species representatives, secondary ANI was set to 0.95. dRep chooses the genome with the highest score as the cluster representative according to the following formula: dRep score = A × Completeness − B × Contamination + C × (Contamination × (Strain heterogeneity/100)) + D × log(N50) + E × log(size) + F × (centrality−secondary ani). A through F are values which can be tuned by the user to change the relative importance of each parameter in choosing representative genomes. Default parameters (A = 1, B = 5, C = 1, D = 0.5, E = 0, F = 1) were used herein. Long-read data were assembled with Lathe v165. Briefly, Lathe implements basecalling with Guppy v2.3.5, assembly with Flye v2.4.2101, and short-read polishing with Pilon v1.23102. Contigs greater than 1000 bp were subsequently binned into draft genomes with MetaBAT v2.13 using minimum contig depth coverage of 1, minimum end-to-end percent identity of reads of 50, and otherwise using default parameters, then classified, and de-replicated as described above. Additional long-read polishing was performed using four iterations of polishing with Racon v1.4.10103 and long-read alignment using minimap2 v2.17-r941104, followed by one round of polishing with Medaka v0.11.5 (https://github.com/nanoporetech/medaka). Single-contig genomes were analyzed for GC skew using SkewIT v1105. Genomes of interest were plotted with the DNAPlotter GUI v18.1.0106. Draft genomes were additionally classified with GTDBtk v1.4.1 (classify_wf)107 using release 95 reference data. Direct comparisons between nMAGs and corresponding MAGs were performed by de-replicating high- and medium-quality nMAGs with MAGs assembled from the same sample. MAGs sharing at least 99% ANI with an nMAG were aligned to the nMAG regions using nucmer v3.1 and uncovered regions of the nMAG were annotated with prokka 1.14.6, VIBRANT v1.2.1108, and ResFams v1.2109. Phylogenetic trees for all de-replicated short- and long-read MAGs were constructed with GTDBtk v1.4.1 and visualized with iTOL v6110. To construct phylogenetic trees for taxa of interest, reference 16S rRNA sequences were downloaded from the Ribosomal Database Project (Release 11, update 5, September 30, 2016)111 and 16S rRNA sequences were identified from nanopore genome assemblies using Barrnap v0.9 (https://github.com/tseemann/barrnap/). Sequences were aligned with MUSCLE v3.8.1551112 with default parameters. Maximum-likelihood phylogenetic trees were constructed from the alignments with FastTree v2.1.10112,113 with default settings (Jukes-Cantor + CAT model). Support values for branch splits were calculated using the Shimodaira-Hasegawa test with 1000 resamples (default). Trees were visualized with FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). Statistical analyses were performed using R v4.0.2114 with packages MASS v7.3-53115, stats v4.0.2114, ggsignif v0.6.0116, and ggpubr v0.4.0117. Alpha and beta diversity were calculated using the vegan package v2.6.0118. Two-sided Wilcoxon rank-sum tests were used to compare alpha and beta diversity between cohorts. Count data were rarefied and normalized via cumulative sum scaling and log2 transformation119 prior to MDS. Data separation in MDS was assessed via PERMANOVA (permutation test with pseudo F ratios) using the adonis function from the vegan package. Differential microbial features between individuals living in Soweto and Bushbuckridge were identified from unnormalized count data output from Kraken 2 classification and Bracken abundance re-estimation (filtered for 20% prevalence and at least 500 sequencing reads per sample) using DESeq2 with the formula “~site”120. Plots were generated in R using the following packages: cowplot v1.0.0121, DESeq2 v1.28.0120, genefilter v1.70.0122, ggplot2 v3.3.2123, ggpubr v0.4.0, ggrepel v0.8.2124, ggsignif v0.6.0, gtools v3.8.2125, harrietr v0.2.3126, MASS v7.3-53, reshape2 v1.4.4127, tidyverse v1.3.0128, and vegan v2.6.0. Further information on research design is available in the Nature Research Reporting Summary linked to this article.