TY - GEN
T1 - An architecture interface and offload model for low-overhead, near-data, distributed accelerators
AU - Baskaran, Saambhavi
AU - Kandemir, Mahmut Taylan
AU - Sampson, John Morgan
N1 - Funding Information:
The authors would like to thank the anonymous reviewers and shepherd for their insightful comments and suggestions. We also thank Adithya Kumar and Dr. Kanchana Bhaaskaran for providing useful feedback on early drafts. This work was funded in part by NSF awards #1822923, #1763681, #2211018, #1931531, #2028929 and #2008398.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models are restricted to fixed compute and access pattern types, which limits software-driven optimizations and the applicability of such an offload interface to heterogeneous accelerator resources. This paper presents a computation offload interface for multi-core systems augmented with distributed on-chip accelerators. With energy-efficiency as the primary goal, we define mechanisms to identify offload partitioning, create a low-overhead execution model to sequence these fine-grained operations, and evaluate a set of workloads to identify the complexity needed to achieve distributed near-data execution. We demonstrate that our model and interface, combining features of dataflow in parallel with near-data processing engines, can be profitably applied to memory hierarchies augmented with either specialized compute substrates or lightweight near-memory cores. We differentiate the benefits stemming from each of elevating data access semantics, near-data computation, inter-accelerator coordination, and compute/access logic specialization. Experimental results indicate a geometric mean (energy efficiency improvement; speedup; data movement reduction) of (3.3; 1.59; 2.4) ×, (2.46; 1.43; 3.5) × and (1.46; 1.65; 1.48) × compared to an out-of-order processor, monolithic accelerator with centralized accesses and monolithic accelerator with decentralized accesses, respectively. Evaluating both lightweight core and CGRA fabric implementations highlights model flexibility and quantifies the benefits of compute specialization for energy efficiency and speedup at 1.23 × and 1.43 ×, respectively.
AB - The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models are restricted to fixed compute and access pattern types, which limits software-driven optimizations and the applicability of such an offload interface to heterogeneous accelerator resources. This paper presents a computation offload interface for multi-core systems augmented with distributed on-chip accelerators. With energy-efficiency as the primary goal, we define mechanisms to identify offload partitioning, create a low-overhead execution model to sequence these fine-grained operations, and evaluate a set of workloads to identify the complexity needed to achieve distributed near-data execution. We demonstrate that our model and interface, combining features of dataflow in parallel with near-data processing engines, can be profitably applied to memory hierarchies augmented with either specialized compute substrates or lightweight near-memory cores. We differentiate the benefits stemming from each of elevating data access semantics, near-data computation, inter-accelerator coordination, and compute/access logic specialization. Experimental results indicate a geometric mean (energy efficiency improvement; speedup; data movement reduction) of (3.3; 1.59; 2.4) ×, (2.46; 1.43; 3.5) × and (1.46; 1.65; 1.48) × compared to an out-of-order processor, monolithic accelerator with centralized accesses and monolithic accelerator with decentralized accesses, respectively. Evaluating both lightweight core and CGRA fabric implementations highlights model flexibility and quantifies the benefits of compute specialization for energy efficiency and speedup at 1.23 × and 1.43 ×, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85141672926&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141672926&partnerID=8YFLogxK
U2 - 10.1109/MICRO56248.2022.00083
DO - 10.1109/MICRO56248.2022.00083
M3 - Conference contribution
AN - SCOPUS:85141672926
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 1160
EP - 1177
BT - Proceedings - 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022
PB - IEEE Computer Society
T2 - 55th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2022
Y2 - 1 October 2022 through 5 October 2022
ER -