Abstract
The reconstruction of microbial genomes from large metagenomic datasets is a critical procedure for finding uncultivated microbial populations and defining their microbial functional roles. To achieve that, we need to perform metagenomic binning, clustering the assembled contigs into draft genomes. Despite the existing computational tools, most of them neglect one important property of the metagenomic data, that is, the noise. To further improve the metagenomic binning step and reconstruct better metagenomes, we propose a deep Contrastive Learning framework for Metagenome Binning (CLMB), which can efficiently eliminate the disturbance of noise and produce more stable and robust results. Essentially, instead of denoising the data explicitly, we add simulated noise to the training data and force the deep learning model to produce similar and stable representations for both the noise-free data and the distorted data. Consequently, the trained model will be robust to noise and handle it implicitly during usage. CLMB outperforms the previous state-of-the-art binning methods significantly, recovering the most near-complete genomes on almost all the benchmarking datasets (up to 17% more reconstructed genomes compared to the second-best method). It also improves the performance of bin refinement, reconstructing 8–22 more high-quality genomes and 15–32 more middle-quality genomes more than the second-best result. Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets. On a real mother-infant microbiome dataset with 110 samples, CLMB is scalable and practical to recover 365 high-quality and middle-quality genomes (including 21 new ones), providing insights into the microbiome transmission. CLMB is open-source and available at https://github.com/zpf0117b/CLMB/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Minibatch k-means and DBSCAN are implemented by scikit-learn: https://scikit-learn.org. Iterative medoid clustering algorithm are implemented by [10]: https://github.com/RasmussenLab/vamb/blob/master/doc/tutorial.ipynb.
- 2.
You can get the whole package data from https://data.cami-challenge.org/participate, or get the contigs and calculated abundance from https://codeocean.com/capsule/1017583/tree/v1.
- 3.
- 4.
- 5.
The Shannon entropy of the five datasets are calculated by [10] on their Supplementary Table 1.
- 6.
The variable recprecof in class Binning.
References
Van Dijk, E.L., Auger, H., Jaszczyszyn, Y., Thermes, C.T.: years of next-generation sequencing technology. Trends Genet. 6, 9 (2014)
Tringe, S., Rubin, E.: Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005)
Quince, C., Walker, A., Simpson, J., et al.: Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017)
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010)
Alneberg, J., Bjarnason, B., de Bruijn, I., et al.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014)
Kislyuk, A., Bhatnagar, S., Dushoff, J., et al.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10, 1–16 (2009)
Kang, D.D., Froula, J., Egan, R., Wang, Z.: Metabat: an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015)
Kang, D.D., et al.: Metabat2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019)
Wu, Y.-W., Simmons, B.A., Singer, S.W.: Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 15 (2016)
Nissen, J.N., Johansen, J., Allese, R.L., et al.: Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021)
Zorrilla, F., Buric, F., Patil, K.R., Zelezniak, A.: metaGEM: reconstruction of genome scale metabolic models directly from metagenomes. Nucleic Acids Res. 49(21), e126–e126 (2021)
van Belkum, A., Burnham, C.D., Rossen, J.W.A., et al.: Innovative and rapid antimicrobial susceptibility testing systems. Nat. Rev. Microbiol. 18, 299–311 (2020)
Fischer-Hwang, I., Ochoa, I., Weissman, T., et al.: Denoising of aligned genomic data. Sci. Rep. 15067 (2019)
Hinton, T.C., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Han, W., et al.: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis. bioRxiv (2021)
Sczyrba, A., Hofmann, P., Belmann, P., et al.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. Arxiv (2014). https://arxiv.org/abs/1312.6114
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. Proc. Mach. Learn. Res. 1278–1286 (2014)
Sculley, D.: Web-scale k-means clustering. In: Proceedings of 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD 1996 Proceedings (1996)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Arxiv (2015). https://arxiv.org/abs/1502.03167
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. Arxiv (2012). https://arxiv.org/pdf/1207.0580.pdf
Maas, A.L., Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Arxiv (2013). https://arxiv.org/pdf/1207.0580.pdf
Doersch, C.: Tutorial on variational autoencoders (2021). https://arxiv.org/abs/1606.05908
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. Arxiv (2017). https://arxiv.org/abs/1412.6980
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Li, H., et al.: The sequence alignment/map format and samtools. Bioinformatics 25, 2078–2079 (2009)
Bowers, R.M., et al.: Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017)
Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11, 1984–1996 (2016)
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 374, 20150202 (2016)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Uritskiy, G.V., DiRuggiero, J., Taylor, J.: Metawrap-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 158 (2018)
Song, W.Z., Thomas, T.: Binning_refiner: improving genome bins through the combination of different binning programs. Bioinformatics 33, 1873–1875 (2017)
Parks, D.H., Imelfort, M., Skennerton, C.T., Hugenholtz, P., Tyson, G.W.: CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015)
Ferretti, P., et al.: Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145.e5 (2018)
Pasolli, E., et al.: Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019)
Leinonen, R., et al.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011)
Chaumeil, P.-A., Mussig, A.J., Hugenholtz, P., Parks, D.H.: GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics 36, 1925–1927 (2020)
Li, Y., et al.: DLBI: deep learning guided Bayesian inference for structure reconstruction of super-resolution fluorescence microscopy. Bioinformatics ISMB 34(13), i284–i294 (2018)
Li, Y., et al.: HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes. Microbiome 9, 1–12 (2021)
Li, Y., et al.: Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 166, 4–21 (2019)
Chen, X., Li, Y., Umarov, R., Gao, X., Song, L.: RNA secondary structure prediction by learning unrolled algorithms. In: International Conference on Learning Representations 2020 (2020)
Li, H., et al.: Modern deep learning in bioinformatics. J. Mol. Cell Biol. 12, 823–827 (2020)
Wei, J., Chen, S., Zong, L., Gao, X., Li, Y.: Protein-RNA interaction prediction with deep learning: structure matters. arXiv preprint arXiv:2107.12243 (2021)
Jain, C., Rodriguez-R, L.M., Phillippy, A.M., et al.: High throughput ANI analysis of 90k prokaryotic genomes reveals clear species boundaries. Nat. Commun. 5114 (2018)
Chen, S., Zhou, Y., Chen, Y., Gu. J.: fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics 34, i884–i890 (2018)
Li, D., Liu, C.-M., Luo, R., Sadakane, K., Lam, T.-W.M.: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31(10), 1674–1676 (2015)
Li, D., et al.: Megahit v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods (2016)
Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016)
Letunic, I., Bork, P.: Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
5 Appendix
A Figures
B Tables
C Methods
In this section, we show the methods and experiments in our research.
1.1 C.1 Feature Calculation of TNFs and Abundance
We use the same approach to calculate TNFs and abundance as the previous work [10]. For each contig, we count the frequencies of each tetramer with definite bases, and, to satisfy statistical constraints, project them into a 103-dimensional independent orthonormal space to obtain TNFs [6]. As a result, the TNFs for each contig are a 103-dimensional numerical vector. We also count the number of individual reads mapped to each contig. More specifically, a read mapped to n contigs counts 1/n towards each. The read counts are normalized by sequence length and total number of mapped reads, which generates the abundance value in reads per kilobase sequence per million mapped reads (RPKM). The resulted abundance for each contig is a s-dimensional numerical vector, where s is the number of samples. TNFs are normalized by z-scaling each tetranucleotide across the sequences, and abundance are normalized across samples.
1.2 C.2 Benchmarking
CLMB and VAMB [10] were run with default parameters with multi-split enabled. MetaBAT2 [8] was run with setting minClsSize = 1 and other parameters as default. MaxBin2 [9] was run with default parameters. The benchmarking results were calculated using benchmark.py script implemented by [10]. The mapping of the recovered genomes to the reference genomes was the intermediate resultFootnote 6 of benchmark.py script. FastANI [45] with default parameters was used to calculate ANI between the reference genomes. For the binning refinement experiment, we use metaWRAP bin_refinement API [32, 33] with parameters –c 50 and –x 10, indicating we keep the genomes qualifying \(completeness>50\%\) and \(contamination<10\%\). The completeness and contamination of the genomes recovered by the bins are calculated using CheckM [34] with default parameters. We use the pipeline integrated in MetaGEM [11] for binning refinement experiment.
1.3 C.3 Data Fusion Experiment
We define the feature data as the raw data, and obtained the projected data by projecting the feature data to 32-dimension space using PCA. For the CLMB-encoded data, we obtained them by encoding the feature data to 32-dimension space with the deep contrastive learning framework. We assess the performance of these data by clustering them with the iterative medoid clustering and obtained the benchmarking results. All the experiments on CAMI2 datasets were run with default parameters with multi-split enabled, and the experiments on MetaHIT datasets was run with default parameters with multi-split disabled. For comparison to other clustering methods, we use MiniBatchKMeans (n_clusters = 750, batch_size = 4096, max_iter = 25, init_size = 20000, reassignment_ratio = 0.02) and DBSCAN (eps = 0.35, min_samples = 2) implemented by scikit-learn.
1.4 C.4 Binning of the Mother-Infant Transmission Dataset
We downloaded the sequencing datasets of selected mother-infant pairs (marked as 10001, 10002, 10003, 10005, 10006, 10007, 10008, 10009, 10015, 10019) using SRA Toolkit and filtered them based on quality using fastp [46]. Then, we assembled the short sequence reads into contigs using MEGAHIT [47, 48] and mapped the reads to the contigs using kallisto [49] in order to speed up this process for large datasets. The coabundance across samples can be subsequently calculated using kallisto quantification algorithm. With the assemblies and coabundances, we ran CLMB with default parameters and multi-split enabled. Then, we splited the fasta file into bins based on the result of clustering using create_fasta.py script. CheckM [34] on lineage specific workflow with default parameters was applied to the resulting bins to calculate the completeness and contamination, and only those with sufficient quality (\(completeness\ge 50\%\), \(contamination\le 5\%\)) were considered for further analysis. Then, we use GTDB-tk [38] on for taxonomic assignment of each bins and phylogeny inference. We visualized the tree with iTOL [50].
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, P., Jiang, Z., Wang, Y., Li, Y. (2022). CLMB: Deep Contrastive Learning for Robust Metagenomic Binning. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)