Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale

  • Michael Shen
  • , Muhammad Umar
  • , Kiwan Maeng
  • , G. Edward Suh
  • , Udit Gupta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rapid advancement of Large Language Models (LLMs) as well as the constantly expanding amount of data make keeping the latest models constantly up-to-date a challenge. The high computational cost required to constantly retrain models to handle evolving data has led to the development of Retrieval-Augmented Generation (RAG). RAG presents a promising solution that enables LLMs to access and incorporate real-time information from external datastores, thus minimizing the need for retraining to update the information available to an LLM. However, as the RAG datastores used to augment information expand into the range of trillions of tokens, retrieval overheads become significant, impacting latency, throughput, and energy efficiency. To address this, we propose Hermes, an algorithm-systems co-design framework that addresses the unique bottlenecks of large-scale RAG systems. Hermes mitigates retrieval latency by partitioning and distributing datastores across multiple nodes, while also enhancing throughput and energy efficiency through an intelligent hierarchical search that dynamically directs queries to optimized subsets of the datastore. On open-source RAG datastores and models, we demonstrate Hermes optimizes end-toend latency and energy by up to 9.33× and 2.10×, without sacrificing retrieval quality for at-scale trillion token retrieval datastores.

Original languageEnglish (US)
Title of host publicationISCA 2025 - Proceedings of the 52nd Annual International Symposium on Computer Architecture
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages958-973
Number of pages16
ISBN (Electronic)9798400712616
DOIs
StatePublished - Jun 21 2025
Event52nd Annual International Symposium on Computer Architecture, ISCA 2025 - Tokyo, Japan
Duration: Jun 21 2025Jun 25 2025

Publication series

NameProceedings - International Symposium on Computer Architecture
ISSN (Print)1063-6897
ISSN (Electronic)2575-713X

Conference

Conference52nd Annual International Symposium on Computer Architecture, ISCA 2025
Country/TerritoryJapan
CityTokyo
Period6/21/256/25/25

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale'. Together they form a unique fingerprint.

Cite this