TY - GEN
T1 - Profiling Hyperscale Big Data Processing
AU - Gonzalez, Abraham
AU - Liu, Sihang
AU - Chang, Jichuan
AU - Kolli, Aasheesh
AU - Dadu, Vidushi
AU - Asanović, Krste
AU - Khan, Samira
AU - Karandikar, Sagar
AU - Ranganathan, Parthasarathy
N1 - Funding Information:
This research was supported by the SLICE Lab industrial sponsors and affiliates and by the NSF CCRI ENS Chipyard Award #201662. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
Publisher Copyright:
© 2023 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2023/6/17
Y1 - 2023/6/17
N2 - Computing demand continues to grow exponentially, largely driven by “big data” processing on hyperscale data stores. At the same time, the slowdown in Moore’s law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale. This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies. We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model.
AB - Computing demand continues to grow exponentially, largely driven by “big data” processing on hyperscale data stores. At the same time, the slowdown in Moore’s law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale. This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies. We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model.
UR - http://www.scopus.com/inward/record.url?scp=85168857830&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168857830&partnerID=8YFLogxK
U2 - 10.1145/3579371.3589082
DO - 10.1145/3579371.3589082
M3 - Conference contribution
AN - SCOPUS:85168857830
T3 - Proceedings - International Symposium on Computer Architecture
SP - 660
EP - 675
BT - ISCA 2023 - Proceedings of the 2023 50th Annual International Symposium on Computer Architecture
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 50th Annual International Symposium on Computer Architecture, ISCA 2023
Y2 - 17 June 2023 through 21 June 2023
ER -