TY - GEN
T1 - Hermes
T2 - 52nd Annual International Symposium on Computer Architecture, ISCA 2025
AU - Shen, Michael
AU - Umar, Muhammad
AU - Maeng, Kiwan
AU - Suh, G. Edward
AU - Gupta, Udit
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/6/21
Y1 - 2025/6/21
N2 - The rapid advancement of Large Language Models (LLMs) as well as the constantly expanding amount of data make keeping the latest models constantly up-to-date a challenge. The high computational cost required to constantly retrain models to handle evolving data has led to the development of Retrieval-Augmented Generation (RAG). RAG presents a promising solution that enables LLMs to access and incorporate real-time information from external datastores, thus minimizing the need for retraining to update the information available to an LLM. However, as the RAG datastores used to augment information expand into the range of trillions of tokens, retrieval overheads become significant, impacting latency, throughput, and energy efficiency. To address this, we propose Hermes, an algorithm-systems co-design framework that addresses the unique bottlenecks of large-scale RAG systems. Hermes mitigates retrieval latency by partitioning and distributing datastores across multiple nodes, while also enhancing throughput and energy efficiency through an intelligent hierarchical search that dynamically directs queries to optimized subsets of the datastore. On open-source RAG datastores and models, we demonstrate Hermes optimizes end-toend latency and energy by up to 9.33× and 2.10×, without sacrificing retrieval quality for at-scale trillion token retrieval datastores.
AB - The rapid advancement of Large Language Models (LLMs) as well as the constantly expanding amount of data make keeping the latest models constantly up-to-date a challenge. The high computational cost required to constantly retrain models to handle evolving data has led to the development of Retrieval-Augmented Generation (RAG). RAG presents a promising solution that enables LLMs to access and incorporate real-time information from external datastores, thus minimizing the need for retraining to update the information available to an LLM. However, as the RAG datastores used to augment information expand into the range of trillions of tokens, retrieval overheads become significant, impacting latency, throughput, and energy efficiency. To address this, we propose Hermes, an algorithm-systems co-design framework that addresses the unique bottlenecks of large-scale RAG systems. Hermes mitigates retrieval latency by partitioning and distributing datastores across multiple nodes, while also enhancing throughput and energy efficiency through an intelligent hierarchical search that dynamically directs queries to optimized subsets of the datastore. On open-source RAG datastores and models, we demonstrate Hermes optimizes end-toend latency and energy by up to 9.33× and 2.10×, without sacrificing retrieval quality for at-scale trillion token retrieval datastores.
UR - https://www.scopus.com/pages/publications/105009603358
UR - https://www.scopus.com/pages/publications/105009603358#tab=citedBy
U2 - 10.1145/3695053.3731076
DO - 10.1145/3695053.3731076
M3 - Conference contribution
AN - SCOPUS:105009603358
T3 - Proceedings - International Symposium on Computer Architecture
SP - 958
EP - 973
BT - ISCA 2025 - Proceedings of the 52nd Annual International Symposium on Computer Architecture
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 June 2025 through 25 June 2025
ER -