CLMB: Deep Contrastive Learning for Robust Metagenomic Binning

Zhang, Pengfei; Jiang, Zhengyuan; Wang, Yixuan; Li, Yu

doi:10.1007/978-3-031-04749-7_23

Pengfei Zhang^9,10,
Zhengyuan Jiang^9,10,
Yixuan Wang^9,12 &
…
Yu Li^9,11

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13278))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2315 Accesses
2 Citations

Abstract

The reconstruction of microbial genomes from large metagenomic datasets is a critical procedure for finding uncultivated microbial populations and defining their microbial functional roles. To achieve that, we need to perform metagenomic binning, clustering the assembled contigs into draft genomes. Despite the existing computational tools, most of them neglect one important property of the metagenomic data, that is, the noise. To further improve the metagenomic binning step and reconstruct better metagenomes, we propose a deep Contrastive Learning framework for Metagenome Binning (CLMB), which can efficiently eliminate the disturbance of noise and produce more stable and robust results. Essentially, instead of denoising the data explicitly, we add simulated noise to the training data and force the deep learning model to produce similar and stable representations for both the noise-free data and the distorted data. Consequently, the trained model will be robust to noise and handle it implicitly during usage. CLMB outperforms the previous state-of-the-art binning methods significantly, recovering the most near-complete genomes on almost all the benchmarking datasets (up to 17% more reconstructed genomes compared to the second-best method). It also improves the performance of bin refinement, reconstructing 8–22 more high-quality genomes and 15–32 more middle-quality genomes more than the second-best result. Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets. On a real mother-infant microbiome dataset with 110 samples, CLMB is scalable and practical to recover 365 high-quality and middle-quality genomes (including 21 new ones), providing insights into the microbiome transmission. CLMB is open-source and available at https://github.com/zpf0117b/CLMB/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Minibatch k-means and DBSCAN are implemented by scikit-learn: https://scikit-learn.org. Iterative medoid clustering algorithm are implemented by [10]: https://github.com/RasmussenLab/vamb/blob/master/doc/tutorial.ipynb.
2.
You can get the whole package data from https://data.cami-challenge.org/participate, or get the contigs and calculated abundance from https://codeocean.com/capsule/1017583/tree/v1.
3.
https://numpy.org.
4.
https://codeocean.com/capsule/1017583/tree/v1.
5.
The Shannon entropy of the five datasets are calculated by [10] on their Supplementary Table 1.
6.
The variable recprecof in class Binning.

References

Van Dijk, E.L., Auger, H., Jaszczyszyn, Y., Thermes, C.T.: years of next-generation sequencing technology. Trends Genet. 6, 9 (2014)
Google Scholar
Tringe, S., Rubin, E.: Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005)
Article Google Scholar
Quince, C., Walker, A., Simpson, J., et al.: Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017)
Article Google Scholar
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010)
Article Google Scholar
Alneberg, J., Bjarnason, B., de Bruijn, I., et al.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014)
Article Google Scholar
Kislyuk, A., Bhatnagar, S., Dushoff, J., et al.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10, 1–16 (2009)
Article Google Scholar
Kang, D.D., Froula, J., Egan, R., Wang, Z.: Metabat: an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015)
Google Scholar
Kang, D.D., et al.: Metabat2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019)
Article Google Scholar
Wu, Y.-W., Simmons, B.A., Singer, S.W.: Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 15 (2016)
Google Scholar
Nissen, J.N., Johansen, J., Allese, R.L., et al.: Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021)
Article Google Scholar
Zorrilla, F., Buric, F., Patil, K.R., Zelezniak, A.: metaGEM: reconstruction of genome scale metabolic models directly from metagenomes. Nucleic Acids Res. 49(21), e126–e126 (2021)
Article Google Scholar
van Belkum, A., Burnham, C.D., Rossen, J.W.A., et al.: Innovative and rapid antimicrobial susceptibility testing systems. Nat. Rev. Microbiol. 18, 299–311 (2020)
Article Google Scholar
Fischer-Hwang, I., Ochoa, I., Weissman, T., et al.: Denoising of aligned genomic data. Sci. Rep. 15067 (2019)
Google Scholar
Hinton, T.C., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Han, W., et al.: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis. bioRxiv (2021)
Google Scholar
Sczyrba, A., Hofmann, P., Belmann, P., et al.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017)
Article Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. Arxiv (2014). https://arxiv.org/abs/1312.6114
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. Proc. Mach. Learn. Res. 1278–1286 (2014)
Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD 1996 Proceedings (1996)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Arxiv (2015). https://arxiv.org/abs/1502.03167
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. Arxiv (2012). https://arxiv.org/pdf/1207.0580.pdf
Maas, A.L., Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Arxiv (2013). https://arxiv.org/pdf/1207.0580.pdf
Doersch, C.: Tutorial on variational autoencoders (2021). https://arxiv.org/abs/1606.05908
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. Arxiv (2017). https://arxiv.org/abs/1412.6980
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Article Google Scholar
Li, H., et al.: The sequence alignment/map format and samtools. Bioinformatics 25, 2078–2079 (2009)
Article Google Scholar
Bowers, R.M., et al.: Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017)
Article Google Scholar
Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11, 1984–1996 (2016)
Article Google Scholar
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 374, 20150202 (2016)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Uritskiy, G.V., DiRuggiero, J., Taylor, J.: Metawrap-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 158 (2018)
Google Scholar
Song, W.Z., Thomas, T.: Binning_refiner: improving genome bins through the combination of different binning programs. Bioinformatics 33, 1873–1875 (2017)
Article Google Scholar
Parks, D.H., Imelfort, M., Skennerton, C.T., Hugenholtz, P., Tyson, G.W.: CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015)
Article Google Scholar
Ferretti, P., et al.: Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145.e5 (2018)
Google Scholar
Pasolli, E., et al.: Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019)
Google Scholar
Leinonen, R., et al.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011)
Article Google Scholar
Chaumeil, P.-A., Mussig, A.J., Hugenholtz, P., Parks, D.H.: GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics 36, 1925–1927 (2020)
Google Scholar
Li, Y., et al.: DLBI: deep learning guided Bayesian inference for structure reconstruction of super-resolution fluorescence microscopy. Bioinformatics ISMB 34(13), i284–i294 (2018)
Article Google Scholar
Li, Y., et al.: HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes. Microbiome 9, 1–12 (2021)
Article Google Scholar
Li, Y., et al.: Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 166, 4–21 (2019)
Article Google Scholar
Chen, X., Li, Y., Umarov, R., Gao, X., Song, L.: RNA secondary structure prediction by learning unrolled algorithms. In: International Conference on Learning Representations 2020 (2020)
Google Scholar
Li, H., et al.: Modern deep learning in bioinformatics. J. Mol. Cell Biol. 12, 823–827 (2020)
Article Google Scholar
Wei, J., Chen, S., Zong, L., Gao, X., Li, Y.: Protein-RNA interaction prediction with deep learning: structure matters. arXiv preprint arXiv:2107.12243 (2021)
Jain, C., Rodriguez-R, L.M., Phillippy, A.M., et al.: High throughput ANI analysis of 90k prokaryotic genomes reveals clear species boundaries. Nat. Commun. 5114 (2018)
Google Scholar
Chen, S., Zhou, Y., Chen, Y., Gu. J.: fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics 34, i884–i890 (2018)
Google Scholar
Li, D., Liu, C.-M., Luo, R., Sadakane, K., Lam, T.-W.M.: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31(10), 1674–1676 (2015)
Article Google Scholar
Li, D., et al.: Megahit v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods (2016)
Google Scholar
Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016)
Article Google Scholar
Letunic, I., Bork, P.: Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, CUHK, Hong Kong SAR, China
Pengfei Zhang, Zhengyuan Jiang, Yixuan Wang & Yu Li
University of Science and Technology of China, Hefei, Anhui, China
Pengfei Zhang & Zhengyuan Jiang
The CUHK Shenzhen Research Institute, Hi-Tech Park, Nanshan, Shenzhen, 518057, China
Yu Li
Department of Mathematics, HIT, Weihai, 264209, China
Yixuan Wang

Authors

Pengfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yixuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Li .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Itsik Pe'er

Appendices

5 Appendix

A Figures

Table 1. Number of genomes at the strain level reconstructed with a precision of at least 95%

Full size table

Table 2. Number of genomes at the strain level reconstructed with a precision of at least 95%

Full size table

Table 3. Number of genomes at the species level reconstructed with a precision of at least 95%

Full size table

B Tables

Table 4. Number of genomes at the genus level reconstructed with a precision of at least 95%

Full size table

C Methods

In this section, we show the methods and experiments in our research.

1.1 C.1 Feature Calculation of TNFs and Abundance

We use the same approach to calculate TNFs and abundance as the previous work [10]. For each contig, we count the frequencies of each tetramer with definite bases, and, to satisfy statistical constraints, project them into a 103-dimensional independent orthonormal space to obtain TNFs [6]. As a result, the TNFs for each contig are a 103-dimensional numerical vector. We also count the number of individual reads mapped to each contig. More specifically, a read mapped to n contigs counts 1/n towards each. The read counts are normalized by sequence length and total number of mapped reads, which generates the abundance value in reads per kilobase sequence per million mapped reads (RPKM). The resulted abundance for each contig is a s-dimensional numerical vector, where s is the number of samples. TNFs are normalized by z-scaling each tetranucleotide across the sequences, and abundance are normalized across samples.

1.2 C.2 Benchmarking

CLMB and VAMB [10] were run with default parameters with multi-split enabled. MetaBAT2 [8] was run with setting minClsSize = 1 and other parameters as default. MaxBin2 [9] was run with default parameters. The benchmarking results were calculated using benchmark.py script implemented by [10]. The mapping of the recovered genomes to the reference genomes was the intermediate result^{Footnote 6} of benchmark.py script. FastANI [45] with default parameters was used to calculate ANI between the reference genomes. For the binning refinement experiment, we use metaWRAP bin_refinement API [32, 33] with parameters –c 50 and –x 10, indicating we keep the genomes qualifying \(completeness>50\%\) and \(contamination<10\%\). The completeness and contamination of the genomes recovered by the bins are calculated using CheckM [34] with default parameters. We use the pipeline integrated in MetaGEM [11] for binning refinement experiment.

1.3 C.3 Data Fusion Experiment

We define the feature data as the raw data, and obtained the projected data by projecting the feature data to 32-dimension space using PCA. For the CLMB-encoded data, we obtained them by encoding the feature data to 32-dimension space with the deep contrastive learning framework. We assess the performance of these data by clustering them with the iterative medoid clustering and obtained the benchmarking results. All the experiments on CAMI2 datasets were run with default parameters with multi-split enabled, and the experiments on MetaHIT datasets was run with default parameters with multi-split disabled. For comparison to other clustering methods, we use MiniBatchKMeans (n_clusters = 750, batch_size = 4096, max_iter = 25, init_size = 20000, reassignment_ratio = 0.02) and DBSCAN (eps = 0.35, min_samples = 2) implemented by scikit-learn.

1.4 C.4 Binning of the Mother-Infant Transmission Dataset

We downloaded the sequencing datasets of selected mother-infant pairs (marked as 10001, 10002, 10003, 10005, 10006, 10007, 10008, 10009, 10015, 10019) using SRA Toolkit and filtered them based on quality using fastp [46]. Then, we assembled the short sequence reads into contigs using MEGAHIT [47, 48] and mapped the reads to the contigs using kallisto [49] in order to speed up this process for large datasets. The coabundance across samples can be subsequently calculated using kallisto quantification algorithm. With the assemblies and coabundances, we ran CLMB with default parameters and multi-split enabled. Then, we splited the fasta file into bins based on the result of clustering using create_fasta.py script. CheckM [34] on lineage specific workflow with default parameters was applied to the resulting bins to calculate the completeness and contamination, and only those with sufficient quality (\(completeness\ge 50\%\), \(contamination\le 5\%\)) were considered for further analysis. Then, we use GTDB-tk [38] on for taxonomic assignment of each bins and phylogeny inference. We visualized the tree with iTOL [50].

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, P., Jiang, Z., Wang, Y., Li, Y. (2022). CLMB: Deep Contrastive Learning for Robust Metagenomic Binning. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-04749-7_23
Published: 29 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CLMB: Deep Contrastive Learning for Robust Metagenomic Binning

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

5 Appendix

A Figures

B Tables

C Methods

1.1 C.1 Feature Calculation of TNFs and Abundance

1.2 C.2 Benchmarking

1.3 C.3 Data Fusion Experiment

1.4 C.4 Binning of the Mother-Infant Transmission Dataset

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation