TY - JOUR
T1 - The determinants of the rarity of nucleic and peptide short sequences in nature
AU - Chantzi, Nikol
AU - Mareboina, Manvita
AU - Konnaris, Maxwell A.
AU - Montgomery, Austin
AU - Patsakis, Michail
AU - Mouratidis, Ioannis
AU - Georgakopoulos-Soares, Ilias
N1 - Publisher Copyright:
© 2024 The Author(s). Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
PY - 2024/6/1
Y1 - 2024/6/1
N2 - The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R2 performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R2 performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
AB - The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R2 performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R2 performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
UR - http://www.scopus.com/inward/record.url?scp=85189695174&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85189695174&partnerID=8YFLogxK
U2 - 10.1093/nargab/lqae029
DO - 10.1093/nargab/lqae029
M3 - Article
C2 - 38584871
AN - SCOPUS:85189695174
SN - 2631-9268
VL - 6
JO - NAR Genomics and Bioinformatics
JF - NAR Genomics and Bioinformatics
IS - 2
M1 - lqae029
ER -