Harnessing the Power of Kmers: Concepts and Methods for Genomic and Proteomic Research

    Project: Research project

    Project Details

    Description

    PROJECT SUMMARY / ABSTRACT In the modern era of genomics and proteomics, the vast amounts of biological data generated present both a challenge and an opportunity. Central to this proposal is the innovative use of kmers, short nucleic or peptide sequences, as a tool to navigate and interpret this data. Kmers are used in a variety of genomics and proteomics applications, including genome assembly and alignment, genomic variant detection and metagenomics. With the continued advancement of sequencing technology, kmers are poised to play an important role in research. Quasi-primes are kmers found in only one single species. We have recently developed algorithms to efficiently identify quasi-primes across every available genome and proteome. In humans, quasi-prime loci are primarily found in brain-expressed genes associated with cognition and are enriched for quantitative trait loci, indicating their significance in the development of species-specific traits. Over the next five years, we will examine quasi- primes in populations of diverse ancestries, in archaic hominins and in primate and mammalian evolution to improve our understanding of their functional and evolutionary significance. Additionally, we will leverage our expertise in performing large scale analyses to expand upon our findings and characterize the functions of these kmers across every sequenced organism and taxonomic group. This will allow us to investigate the underlying mechanisms that enable species to develop new traits and adapt to their environment. The composition of organismal genomes depends on a variety of factors, including genome size, genomic instability, and biological processes, such as transcription and translation. We aim to investigate how these factors shape the composition of genomes in every species and across all taxonomic groups. We will integrate different types of genomic and proteomic data, including kmer frequency profiles, codon usage tables, and transcription and translation annotations. Our goal is to deconvolute the relative contributions of different factors shaping the composition and evolution of organismal genomes. Building on this, we plan to incorporate these findings into generative artificial intelligence models to create improved simulated genomes that will have significant applications as synthetic controls for bioinformatics analyses. Finally, we will provide well-documented, open-source software tools and integrate the data from our projects into accessible databases, aligned to the FAIR principles. In doing so, we aim to not only advance research in our specific areas of focus but also equip other researchers with tools and datasets they can utilize in their distinct domains of expertise. In summary, our multifaceted approach seeks to harness the power of kmers in genomics and proteomics, delve into the intricacies of evolutionary processes, and provide the scientific community with computational resources, fostering collaboration and innovation in basic and biomedical research areas.
    StatusActive
    Effective start/end date9/1/246/30/25

    Funding

    • National Institute of General Medical Sciences: $415,142.00

    Fingerprint

    Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.