TY - JOUR
T1 - Mix and Match
T2 - Reorganizing Tasks for Enhancing Data Locality
AU - Tang, Xulong
AU - Kandemir, Mahmut Taylan
AU - Karakoy, Mustafa
N1 - Funding Information:
This work is supported in part by NSF grants #1908793, #1629915, #1629129, #1763681, #2028929, #2008398, #2011146, and #1931531, as well as a startup funding from the University of Pittsburgh.
Publisher Copyright:
© 2021 ACM.
PY - 2021/6
Y1 - 2021/6
N2 - Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.
AB - Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.
UR - http://www.scopus.com/inward/record.url?scp=85108005838&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108005838&partnerID=8YFLogxK
U2 - 10.1145/3460087
DO - 10.1145/3460087
M3 - Article
AN - SCOPUS:85108005838
SN - 2476-1249
VL - 5
JO - Proceedings of the ACM on Measurement and Analysis of Computing Systems
JF - Proceedings of the ACM on Measurement and Analysis of Computing Systems
IS - 2
M1 - 20
ER -