Skip to main content

CLMB: Deep Contrastive Learning for Robust Metagenomic Binning

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2022)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13278))

Abstract

The reconstruction of microbial genomes from large metagenomic datasets is a critical procedure for finding uncultivated microbial populations and defining their microbial functional roles. To achieve that, we need to perform metagenomic binning, clustering the assembled contigs into draft genomes. Despite the existing computational tools, most of them neglect one important property of the metagenomic data, that is, the noise. To further improve the metagenomic binning step and reconstruct better metagenomes, we propose a deep Contrastive Learning framework for Metagenome Binning (CLMB), which can efficiently eliminate the disturbance of noise and produce more stable and robust results. Essentially, instead of denoising the data explicitly, we add simulated noise to the training data and force the deep learning model to produce similar and stable representations for both the noise-free data and the distorted data. Consequently, the trained model will be robust to noise and handle it implicitly during usage. CLMB outperforms the previous state-of-the-art binning methods significantly, recovering the most near-complete genomes on almost all the benchmarking datasets (up to 17% more reconstructed genomes compared to the second-best method). It also improves the performance of bin refinement, reconstructing 8–22 more high-quality genomes and 15–32 more middle-quality genomes more than the second-best result. Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets. On a real mother-infant microbiome dataset with 110 samples, CLMB is scalable and practical to recover 365 high-quality and middle-quality genomes (including 21 new ones), providing insights into the microbiome transmission. CLMB is open-source and available at https://github.com/zpf0117b/CLMB/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Minibatch k-means and DBSCAN are implemented by scikit-learn: https://scikit-learn.org. Iterative medoid clustering algorithm are implemented by [10]: https://github.com/RasmussenLab/vamb/blob/master/doc/tutorial.ipynb.

  2. 2.

    You can get the whole package data from https://data.cami-challenge.org/participate, or get the contigs and calculated abundance from https://codeocean.com/capsule/1017583/tree/v1.

  3. 3.

    https://numpy.org.

  4. 4.

    https://codeocean.com/capsule/1017583/tree/v1.

  5. 5.

    The Shannon entropy of the five datasets are calculated by [10] on their Supplementary Table 1.

  6. 6.

    The variable recprecof in class Binning.

References

  1. Van Dijk, E.L., Auger, H., Jaszczyszyn, Y., Thermes, C.T.: years of next-generation sequencing technology. Trends Genet. 6, 9 (2014)

    Google Scholar 

  2. Tringe, S., Rubin, E.: Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005)

    Article  Google Scholar 

  3. Quince, C., Walker, A., Simpson, J., et al.: Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017)

    Article  Google Scholar 

  4. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010)

    Article  Google Scholar 

  5. Alneberg, J., Bjarnason, B., de Bruijn, I., et al.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014)

    Article  Google Scholar 

  6. Kislyuk, A., Bhatnagar, S., Dushoff, J., et al.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10, 1–16 (2009)

    Article  Google Scholar 

  7. Kang, D.D., Froula, J., Egan, R., Wang, Z.: Metabat: an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015)

    Google Scholar 

  8. Kang, D.D., et al.: Metabat2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019)

    Article  Google Scholar 

  9. Wu, Y.-W., Simmons, B.A., Singer, S.W.: Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 15 (2016)

    Google Scholar 

  10. Nissen, J.N., Johansen, J., Allese, R.L., et al.: Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021)

    Article  Google Scholar 

  11. Zorrilla, F., Buric, F., Patil, K.R., Zelezniak, A.: metaGEM: reconstruction of genome scale metabolic models directly from metagenomes. Nucleic Acids Res. 49(21), e126–e126 (2021)

    Article  Google Scholar 

  12. van Belkum, A., Burnham, C.D., Rossen, J.W.A., et al.: Innovative and rapid antimicrobial susceptibility testing systems. Nat. Rev. Microbiol. 18, 299–311 (2020)

    Article  Google Scholar 

  13. Fischer-Hwang, I., Ochoa, I., Weissman, T., et al.: Denoising of aligned genomic data. Sci. Rep. 15067 (2019)

    Google Scholar 

  14. Hinton, T.C., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  15. Han, W., et al.: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis. bioRxiv (2021)

    Google Scholar 

  16. Sczyrba, A., Hofmann, P., Belmann, P., et al.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017)

    Article  Google Scholar 

  17. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. Arxiv (2014). https://arxiv.org/abs/1312.6114

  18. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. Proc. Mach. Learn. Res. 1278–1286 (2014)

    Google Scholar 

  19. Sculley, D.: Web-scale k-means clustering. In: Proceedings of 19th International Conference on World Wide Web, pp. 1177–1178 (2010)

    Google Scholar 

  20. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD 1996 Proceedings (1996)

    Google Scholar 

  21. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Arxiv (2015). https://arxiv.org/abs/1502.03167

  22. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. Arxiv (2012). https://arxiv.org/pdf/1207.0580.pdf

  23. Maas, A.L., Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Arxiv (2013). https://arxiv.org/pdf/1207.0580.pdf

  24. Doersch, C.: Tutorial on variational autoencoders (2021). https://arxiv.org/abs/1606.05908

  25. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. Arxiv (2017). https://arxiv.org/abs/1412.6980

  26. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009)

    Article  Google Scholar 

  27. Li, H., et al.: The sequence alignment/map format and samtools. Bioinformatics 25, 2078–2079 (2009)

    Article  Google Scholar 

  28. Bowers, R.M., et al.: Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017)

    Article  Google Scholar 

  29. Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11, 1984–1996 (2016)

    Article  Google Scholar 

  30. Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 374, 20150202 (2016)

    Google Scholar 

  31. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  32. Uritskiy, G.V., DiRuggiero, J., Taylor, J.: Metawrap-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 158 (2018)

    Google Scholar 

  33. Song, W.Z., Thomas, T.: Binning_refiner: improving genome bins through the combination of different binning programs. Bioinformatics 33, 1873–1875 (2017)

    Article  Google Scholar 

  34. Parks, D.H., Imelfort, M., Skennerton, C.T., Hugenholtz, P., Tyson, G.W.: CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015)

    Article  Google Scholar 

  35. Ferretti, P., et al.: Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145.e5 (2018)

    Google Scholar 

  36. Pasolli, E., et al.: Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019)

    Google Scholar 

  37. Leinonen, R., et al.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011)

    Article  Google Scholar 

  38. Chaumeil, P.-A., Mussig, A.J., Hugenholtz, P., Parks, D.H.: GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics 36, 1925–1927 (2020)

    Google Scholar 

  39. Li, Y., et al.: DLBI: deep learning guided Bayesian inference for structure reconstruction of super-resolution fluorescence microscopy. Bioinformatics ISMB 34(13), i284–i294 (2018)

    Article  Google Scholar 

  40. Li, Y., et al.: HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes. Microbiome 9, 1–12 (2021)

    Article  Google Scholar 

  41. Li, Y., et al.: Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 166, 4–21 (2019)

    Article  Google Scholar 

  42. Chen, X., Li, Y., Umarov, R., Gao, X., Song, L.: RNA secondary structure prediction by learning unrolled algorithms. In: International Conference on Learning Representations 2020 (2020)

    Google Scholar 

  43. Li, H., et al.: Modern deep learning in bioinformatics. J. Mol. Cell Biol. 12, 823–827 (2020)

    Article  Google Scholar 

  44. Wei, J., Chen, S., Zong, L., Gao, X., Li, Y.: Protein-RNA interaction prediction with deep learning: structure matters. arXiv preprint arXiv:2107.12243 (2021)

  45. Jain, C., Rodriguez-R, L.M., Phillippy, A.M., et al.: High throughput ANI analysis of 90k prokaryotic genomes reveals clear species boundaries. Nat. Commun. 5114 (2018)

    Google Scholar 

  46. Chen, S., Zhou, Y., Chen, Y., Gu. J.: fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics 34, i884–i890 (2018)

    Google Scholar 

  47. Li, D., Liu, C.-M., Luo, R., Sadakane, K., Lam, T.-W.M.: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31(10), 1674–1676 (2015)

    Article  Google Scholar 

  48. Li, D., et al.: Megahit v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods (2016)

    Google Scholar 

  49. Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016)

    Article  Google Scholar 

  50. Letunic, I., Bork, P.: Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Li .

Editor information

Editors and Affiliations

Appendices

5 Appendix

A Figures

Fig. 7.
figure 7

Performance of different clustering algorithms based on five datasets. Orange: DBSCAN Algorithm. Green: Exclude the outlier using DBSCAN first and cluster the others points using minibatch k-means algorithm. Red: Iterative medoid algorithm, which is developed by [10] and used by CLMB. (Color figure online)

Fig. 8.
figure 8

Performance of CLMB with different samples. For any given number of samples, samples were randomly drawn 3 times and executed independently. For “single-sample”, all the samples were run independently. We note that for increasing number of samples, the random subsets chosen is not independent, due to only having 9 (Urog) or 10 (Airways, GI, Skin, Oral) samples in total. Orange: Multi-split workflow of CLMB, Green: Single sample workflow of CLMB. (Color figure online)

Fig. 9.
figure 9

Performance of CLMB with different k-mer length on different datasets. It is assessed by the number of reconstructed NC strains. The performance varies among the datasets.

Table 1. Number of genomes at the strain level reconstructed with a precision of at least 95%
Table 2. Number of genomes at the strain level reconstructed with a precision of at least 95%
Table 3. Number of genomes at the species level reconstructed with a precision of at least 95%

B Tables

Table 4. Number of genomes at the genus level reconstructed with a precision of at least 95%

C Methods

In this section, we show the methods and experiments in our research.

1.1 C.1 Feature Calculation of TNFs and Abundance

We use the same approach to calculate TNFs and abundance as the previous work [10]. For each contig, we count the frequencies of each tetramer with definite bases, and, to satisfy statistical constraints, project them into a 103-dimensional independent orthonormal space to obtain TNFs [6]. As a result, the TNFs for each contig are a 103-dimensional numerical vector. We also count the number of individual reads mapped to each contig. More specifically, a read mapped to n contigs counts 1/n towards each. The read counts are normalized by sequence length and total number of mapped reads, which generates the abundance value in reads per kilobase sequence per million mapped reads (RPKM). The resulted abundance for each contig is a s-dimensional numerical vector, where s is the number of samples. TNFs are normalized by z-scaling each tetranucleotide across the sequences, and abundance are normalized across samples.

1.2 C.2 Benchmarking

CLMB and VAMB [10] were run with default parameters with multi-split enabled. MetaBAT2 [8] was run with setting minClsSize = 1 and other parameters as default. MaxBin2 [9] was run with default parameters. The benchmarking results were calculated using benchmark.py script implemented by [10]. The mapping of the recovered genomes to the reference genomes was the intermediate resultFootnote 6 of benchmark.py script. FastANI [45] with default parameters was used to calculate ANI between the reference genomes. For the binning refinement experiment, we use metaWRAP bin_refinement API [32, 33] with parameters –c 50 and –x 10, indicating we keep the genomes qualifying \(completeness>50\%\) and \(contamination<10\%\). The completeness and contamination of the genomes recovered by the bins are calculated using CheckM [34] with default parameters. We use the pipeline integrated in MetaGEM [11] for binning refinement experiment.

1.3 C.3 Data Fusion Experiment

We define the feature data as the raw data, and obtained the projected data by projecting the feature data to 32-dimension space using PCA. For the CLMB-encoded data, we obtained them by encoding the feature data to 32-dimension space with the deep contrastive learning framework. We assess the performance of these data by clustering them with the iterative medoid clustering and obtained the benchmarking results. All the experiments on CAMI2 datasets were run with default parameters with multi-split enabled, and the experiments on MetaHIT datasets was run with default parameters with multi-split disabled. For comparison to other clustering methods, we use MiniBatchKMeans (n_clusters = 750, batch_size = 4096, max_iter = 25, init_size = 20000, reassignment_ratio = 0.02) and DBSCAN (eps = 0.35, min_samples = 2) implemented by scikit-learn.

1.4 C.4 Binning of the Mother-Infant Transmission Dataset

We downloaded the sequencing datasets of selected mother-infant pairs (marked as 10001, 10002, 10003, 10005, 10006, 10007, 10008, 10009, 10015, 10019) using SRA Toolkit and filtered them based on quality using fastp [46]. Then, we assembled the short sequence reads into contigs using MEGAHIT [47, 48] and mapped the reads to the contigs using kallisto [49] in order to speed up this process for large datasets. The coabundance across samples can be subsequently calculated using kallisto quantification algorithm. With the assemblies and coabundances, we ran CLMB with default parameters and multi-split enabled. Then, we splited the fasta file into bins based on the result of clustering using create_fasta.py script. CheckM [34] on lineage specific workflow with default parameters was applied to the resulting bins to calculate the completeness and contamination, and only those with sufficient quality (\(completeness\ge 50\%\), \(contamination\le 5\%\)) were considered for further analysis. Then, we use GTDB-tk [38] on for taxonomic assignment of each bins and phylogeny inference. We visualized the tree with iTOL [50].

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, P., Jiang, Z., Wang, Y., Li, Y. (2022). CLMB: Deep Contrastive Learning for Robust Metagenomic Binning. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04749-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04748-0

  • Online ISBN: 978-3-031-04749-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics