TY - GEN
T1 - PRIMATE
T2 - 29th Asia and South Pacific Design Automation Conference, ASP-DAC 2024
AU - Pan, Yue
AU - Zhou, Minxuan
AU - Lee, Chonghan
AU - Li, Zheyu
AU - Kushwah, Rishika
AU - Narayanan, Vijaykrishnan
AU - Rosing, Tajana
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Attention-based models such as Transformers represent the state of the art for various machine learning (ML) tasks. Their superior performance is often overshadowed by the substantial memory requirements and low data reuse opportunities. Processing in Memory (PIM) is a promising solution to accelerate Transformer models due to its massive parallelism, low data movement costs, and high memory bandwidth utilization. Existing PIM accelerators lack the support for algorithmic optimizations like dynamic token pruning that can significantly improve the efficiency of Transformers. We identify two challenges to enabling dynamic token pruning on PIM-based architectures: the lack of an in-memory top-k token selection mechanism and the memory underutilization problem from pruning. To address these challenges, we propose PRIMATE, a software-hardware co-design PIM framework based on High Bandwidth Memory (HBM). We initiate minor hardware modifications to conventional HBM to enable Transformer model computation and top-k selection. For software, we introduce a pipelined mapping scheme and an optimization framework for maximum throughput and efficiency. PRIMATE achieves 30.6× improvement in throughput, 29.5× improvement in space efficiency, and 4.3× better energy efficiency compared to the current state-of-the-art PIM accelerator for Transformers.
AB - Attention-based models such as Transformers represent the state of the art for various machine learning (ML) tasks. Their superior performance is often overshadowed by the substantial memory requirements and low data reuse opportunities. Processing in Memory (PIM) is a promising solution to accelerate Transformer models due to its massive parallelism, low data movement costs, and high memory bandwidth utilization. Existing PIM accelerators lack the support for algorithmic optimizations like dynamic token pruning that can significantly improve the efficiency of Transformers. We identify two challenges to enabling dynamic token pruning on PIM-based architectures: the lack of an in-memory top-k token selection mechanism and the memory underutilization problem from pruning. To address these challenges, we propose PRIMATE, a software-hardware co-design PIM framework based on High Bandwidth Memory (HBM). We initiate minor hardware modifications to conventional HBM to enable Transformer model computation and top-k selection. For software, we introduce a pipelined mapping scheme and an optimization framework for maximum throughput and efficiency. PRIMATE achieves 30.6× improvement in throughput, 29.5× improvement in space efficiency, and 4.3× better energy efficiency compared to the current state-of-the-art PIM accelerator for Transformers.
UR - http://www.scopus.com/inward/record.url?scp=85189361079&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85189361079&partnerID=8YFLogxK
U2 - 10.1109/ASP-DAC58780.2024.10473968
DO - 10.1109/ASP-DAC58780.2024.10473968
M3 - Conference contribution
AN - SCOPUS:85189361079
T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
SP - 557
EP - 563
BT - ASP-DAC 2024 - 29th Asia and South Pacific Design Automation Conference, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 January 2024 through 25 January 2024
ER -