The determinants of the rarity of nucleic and peptide short sequences in nature

Nikol Chantzi, Manvita Mareboina, Maxwell A. Konnaris, Austin Montgomery, Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares

Research output: Contribution to journalArticlepeer-review


The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R2 performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R2 performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.

Original languageEnglish (US)
Article numberlqae029
JournalNAR Genomics and Bioinformatics
Issue number2
StatePublished - Jun 1 2024

All Science Journal Classification (ASJC) codes

  • Structural Biology
  • Molecular Biology
  • Genetics
  • Computer Science Applications
  • Applied Mathematics

Cite this