TY - JOUR
T1 - Reducing false sharing and improving spatial locality in a unified compilation framework
AU - Kandemir, Mahmut
AU - Choudhary, Alok
AU - Ramanujam, J.
AU - Banerjee, Prith
N1 - Funding Information:
The authors would like to thank the anonymous referees for providing helpful comments. The material presented in this paper is based on research supported in part by the US National Science Foundation grants CCR-9357840 and CCR-9509143, and the Air Force Materials Command under contract F30602-97-C-0026. P. Banerjee is supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract F30602-98-2-0144. J. Ramanujam is supported in part by a US National Science Foundation Young Investigator Award CCR-9457768, a US National Science Foundation grant CCR-0073800, and a US National Science Foundation Information Technology Research grant CHE-0121706.
PY - 2003/4
Y1 - 2003/4
N2 - The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit, and the spatial locality exhibited by the applications, In addition to the amount of parallelism In the applications. Large coherence units are helpful in exploiting spatial locality, but worsen the effects of false sharing. A mathematical framework that allows a clean description of the relationship between spatial locality and false sharing is derived In this paper. First, a technique to identify a severe form of multiple-writer false sharing is presented. The importance of the interaction between optimization techniques aimed at enhancing locality and the techniques oriented toward reducing false sharing Is then demonstrated. Given the conflicting requirements, a compiler-based approach to this problem holds promise. This paper Investigates the use of data transformations in addressing spatial locality and false sharing, and derives an approach that balances the impact of the two. Experimental results demonstrate that such a balanced approach outperforms those approaches that consider only one of these two issues. On an eight-processor SGI/Cray Origin 2000 multiprocessor, our approach brings an additional 9 percent improvement over a powerful locality optimization technique that uses both loop and data transformations. Also, the presented approach obtains an additional 19 percent Improvement over an optimization technique that is oriented specifically toward reducing false sharing. This study also reveals that, in addition to reducing synchronization costs and improving the memory subsystem performance, obtaining large granularity parallelism Is helpful in balancing the effects of enhancing locality and reducing false sharing, rendering them compatible.
AB - The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit, and the spatial locality exhibited by the applications, In addition to the amount of parallelism In the applications. Large coherence units are helpful in exploiting spatial locality, but worsen the effects of false sharing. A mathematical framework that allows a clean description of the relationship between spatial locality and false sharing is derived In this paper. First, a technique to identify a severe form of multiple-writer false sharing is presented. The importance of the interaction between optimization techniques aimed at enhancing locality and the techniques oriented toward reducing false sharing Is then demonstrated. Given the conflicting requirements, a compiler-based approach to this problem holds promise. This paper Investigates the use of data transformations in addressing spatial locality and false sharing, and derives an approach that balances the impact of the two. Experimental results demonstrate that such a balanced approach outperforms those approaches that consider only one of these two issues. On an eight-processor SGI/Cray Origin 2000 multiprocessor, our approach brings an additional 9 percent improvement over a powerful locality optimization technique that uses both loop and data transformations. Also, the presented approach obtains an additional 19 percent Improvement over an optimization technique that is oriented specifically toward reducing false sharing. This study also reveals that, in addition to reducing synchronization costs and improving the memory subsystem performance, obtaining large granularity parallelism Is helpful in balancing the effects of enhancing locality and reducing false sharing, rendering them compatible.
UR - http://www.scopus.com/inward/record.url?scp=0038633597&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0038633597&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2003.1195407
DO - 10.1109/TPDS.2003.1195407
M3 - Article
AN - SCOPUS:0038633597
SN - 1045-9219
VL - 14
SP - 337
EP - 354
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 4
ER -