TY - GEN
T1 - Fine-granular computation and data layout reorganization for improving locality
AU - Kandemir, Mahmut
AU - Tang, Xulong
AU - Kotra, Jagadish
AU - Karakoy, Mustafa
N1 - Publisher Copyright:
© 2022 Association for Computing Machinery.
PY - 2022/10/30
Y1 - 2022/10/30
N2 - While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarsegrain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes - 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.
AB - While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarsegrain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes - 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.
UR - https://www.scopus.com/pages/publications/85145665934
UR - https://www.scopus.com/pages/publications/85145665934#tab=citedBy
U2 - 10.1145/3508352.3549386
DO - 10.1145/3508352.3549386
M3 - Conference contribution
AN - SCOPUS:85145665934
T3 - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
BT - Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 41st IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2022
Y2 - 30 October 2022 through 4 November 2022
ER -