TY - JOUR
T1 - Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines
AU - Kandemir, M.
AU - Ramanujam, J.
AU - Choudhary, A.
N1 - Funding Information:
This work is supported in part by NSF Young Investigator Award CCR-9357840, NSF CCR-9509143. The work of J. Ramanujam is supported in part by an NSF Young Investigator Award CCR-9457768 and by NSF Grant CCR-9210422.
Copyright:
Copyright 2004 Elsevier Science B.V., Amsterdam. All rights reserved.
PY - 2000/8
Y1 - 2000/8
N2 - Distributed-memory message-passing machines deliver scalable performance but are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to different parallel architectures, issues such as data distribution are unique to specific architectures. One of the most important challenges compiler writers face is the design of compilation techniques that can work well on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the iterations of the outermost loops can be run in parallel across processors; (2) optimizes memory locality by carefully distributing each array across processors; (3) optimizes interprocessor communication using message vectorization whenever possible; and (4) optimizes cache locality by assigning appropriate storage layout for each array and by transforming the iteration space. Depending on the machine architecture, some or all of these steps can be applied in a unified framework. We report empirical results on an SGI Origin 2000 distributed-shared-memory multiprocessor and an IBM SP-2 distributed-memory message-passing machine to validate the effectiveness of our approach.
AB - Distributed-memory message-passing machines deliver scalable performance but are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to different parallel architectures, issues such as data distribution are unique to specific architectures. One of the most important challenges compiler writers face is the design of compilation techniques that can work well on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the iterations of the outermost loops can be run in parallel across processors; (2) optimizes memory locality by carefully distributing each array across processors; (3) optimizes interprocessor communication using message vectorization whenever possible; and (4) optimizes cache locality by assigning appropriate storage layout for each array and by transforming the iteration space. Depending on the machine architecture, some or all of these steps can be applied in a unified framework. We report empirical results on an SGI Origin 2000 distributed-shared-memory multiprocessor and an IBM SP-2 distributed-memory message-passing machine to validate the effectiveness of our approach.
UR - https://www.scopus.com/pages/publications/0000563616
UR - https://www.scopus.com/inward/citedby.url?scp=0000563616&partnerID=8YFLogxK
U2 - 10.1006/jpdc.2000.1639
DO - 10.1006/jpdc.2000.1639
M3 - Article
AN - SCOPUS:0000563616
SN - 0743-7315
VL - 60
SP - 924
EP - 965
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
IS - 8
ER -