Abstract
Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
Original language | English (US) |
---|---|
Pages (from-to) | 594-602 |
Number of pages | 9 |
Journal | Nature |
Volume | 622 |
Issue number | 7983 |
DOIs | |
State | Published - Oct 19 2023 |
All Science Journal Classification (ASJC) codes
- General
Access to Document
Other files and links
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
}
In: Nature, Vol. 622, No. 7983, 19.10.2023, p. 594-602.
Research output: Contribution to journal › Article › peer-review
TY - JOUR
T1 - Unraveling the functional dark matter through global metagenomics
AU - Pavlopoulos, Georgios A.
AU - Baltoumas, Fotis A.
AU - Liu, Sirui
AU - Selvitopi, Oguz
AU - Camargo, Antonio Pedro
AU - Nayfach, Stephen
AU - Azad, Ariful
AU - Roux, Simon
AU - Call, Lee
AU - Ivanova, Natalia N.
AU - Chen, I. Min
AU - Paez-Espino, David
AU - Karatzas, Evangelos
AU - Acinas, Silvia G.
AU - Ahlgren, Nathan
AU - Attwood, Graeme
AU - Baldrian, Petr
AU - Berry, Timothy
AU - Bhatnagar, Jennifer M.
AU - Bhaya, Devaki
AU - Bidle, Kay D.
AU - Blanchard, Jeffrey L.
AU - Boyd, Eric S.
AU - Bowen, Jennifer L.
AU - Bowman, Jeff
AU - Brawley, Susan H.
AU - Brodie, Eoin L.
AU - Brune, Andreas
AU - Bryant, Donald A.
AU - Buchan, Alison
AU - Cadillo-Quiroz, Hinsby
AU - Campbell, Barbara J.
AU - Cavicchioli, Ricardo
AU - Chuckran, Peter F.
AU - Coleman, Maureen
AU - Crowe, Sean
AU - Colman, Daniel R.
AU - Currie, Cameron R.
AU - Dangl, Jeff
AU - Delherbe, Nathalie
AU - Denef, Vincent J.
AU - Dijkstra, Paul
AU - Distel, Daniel D.
AU - Eloe-Fadrosh, Emiley
AU - Fisher, Kirsten
AU - Francis, Christopher
AU - Garoutte, Aaron
AU - Gaudin, Amelie
AU - Gerwick, Lena
AU - Godoy-Vitorino, Filipa
AU - Guerra, Peter
AU - Guo, Jiarong
AU - Habteselassie, Mussie Y.
AU - Hallam, Steven J.
AU - Hatzenpichler, Roland
AU - Hentschel, Ute
AU - Hess, Matthias
AU - Hirsch, Ann M.
AU - Hug, Laura A.
AU - Hultman, Jenni
AU - Hunt, Dana E.
AU - Huntemann, Marcel
AU - Inskeep, William P.
AU - James, Timothy Y.
AU - Jansson, Janet
AU - Johnston, Eric R.
AU - Kalyuzhnaya, Marina
AU - Kelly, Charlene N.
AU - Kelly, Robert M.
AU - Klassen, Jonathan L.
AU - Nüsslein, Klaus
AU - Kostka, Joel E.
AU - Lindow, Steven
AU - Lilleskov, Erik
AU - Lynes, Mackenzie
AU - Mackelprang, Rachel
AU - Martin, Francis M.
AU - Mason, Olivia U.
AU - McKay, R. Michael
AU - McMahon, Katherine
AU - Mead, David A.
AU - Medina, Monica
AU - Meredith, Laura K.
AU - Mock, Thomas
AU - Mohn, William W.
AU - Moran, Mary Ann
AU - Murray, Alison
AU - Neufeld, Josh D.
AU - Neumann, Rebecca
AU - Norton, Jeanette M.
AU - Partida-Martinez, Laila P.
AU - Pietrasiak, Nicole
AU - Pelletier, Dale
AU - Reddy, T. B.K.
AU - Reese, Brandi Kiel
AU - Reichart, Nicholas J.
AU - Reiss, Rebecca
AU - Saito, Mak A.
AU - Schachtman, Daniel P.
AU - Seshadri, Rekha
AU - Shade, Ashley
AU - Sherman, David
AU - Simister, Rachel
AU - Simon, Holly
AU - Stegen, James
AU - Stepanauskas, Ramunas
AU - Sullivan, Matthew
AU - Sumner, Dawn Y.
AU - Teeling, Hanno
AU - Thamatrakoln, Kimberlee
AU - Treseder, Kathleen
AU - Tringe, Susannah
AU - Vaishampayan, Parag
AU - Valentine, David L.
AU - Waldo, Nicholas B.
AU - Waldrop, Mark P.
AU - Walsh, David A.
AU - Ward, David M.
AU - Wilkins, Michael
AU - Whitman, Thea
AU - Woolet, Jamie
AU - Woyke, Tanja
AU - Iliopoulos, Ioannis
AU - Konstantinidis, Konstantinos
AU - Tiedje, James M.
AU - Pett-Ridge, Jennifer
AU - Baker, David
AU - Visel, Axel
AU - Ouzounis, Christos A.
AU - Ovchinnikov, Sergey
AU - Buluç, Aydin
AU - Kyrpides, Nikos C.
N1 - Publisher Copyright: © 2023, The Author(s).
PY - 2023/10/19
Y1 - 2023/10/19
N2 - Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
AB - Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
UR - http://www.scopus.com/inward/record.url?scp=85173864812&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85173864812&partnerID=8YFLogxK
U2 - 10.1038/s41586-023-06583-7
DO - 10.1038/s41586-023-06583-7
M3 - Article
C2 - 37821698
AN - SCOPUS:85173864812
SN - 0028-0836
VL - 622
SP - 594
EP - 602
JO - Nature
JF - Nature
IS - 7983
ER -