TY - GEN
T1 - Locality-aware dynamic mapping for multithreaded applications
AU - Demiroz, Betul
AU - Topcuoglu, Haluk Rahmi
AU - Kandemir, Mahmut
AU - Tosun, Oguz
PY - 2012
Y1 - 2012
N2 - Locality analysis of an application helps us extract data access patterns and predict runtime cache behavior. In this paper, we propose a locality-aware dynamic mapping algorithm for multithreaded applications, which assigns computations with similar data access patterns to same cores.We collect the amounts of shared and distinct data used by all computations, called chunks and calculate sharing among those chunks. Then, chunks with the similar data access patterns are grouped into bins, which are subsequently assigned to threads for improving cache reuse and program performance. Our algorithm is illustrated with sparse matrix-vector multiply (SpMV), which is one of the most widely used kernel in engineering and scientific computing and suffers from irregular and indirect memory access patterns. Five inputs with different shapes and characteristics are considered for testing the performance of our algorithm. Based on the results of experimental study, our algorithm outperforms Linux scheduler with an average of 12.5% performance improvement for various scenarios considered.
AB - Locality analysis of an application helps us extract data access patterns and predict runtime cache behavior. In this paper, we propose a locality-aware dynamic mapping algorithm for multithreaded applications, which assigns computations with similar data access patterns to same cores.We collect the amounts of shared and distinct data used by all computations, called chunks and calculate sharing among those chunks. Then, chunks with the similar data access patterns are grouped into bins, which are subsequently assigned to threads for improving cache reuse and program performance. Our algorithm is illustrated with sparse matrix-vector multiply (SpMV), which is one of the most widely used kernel in engineering and scientific computing and suffers from irregular and indirect memory access patterns. Five inputs with different shapes and characteristics are considered for testing the performance of our algorithm. Based on the results of experimental study, our algorithm outperforms Linux scheduler with an average of 12.5% performance improvement for various scenarios considered.
UR - http://www.scopus.com/inward/record.url?scp=84862139294&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84862139294&partnerID=8YFLogxK
U2 - 10.1109/PDP.2012.84
DO - 10.1109/PDP.2012.84
M3 - Conference contribution
AN - SCOPUS:84862139294
SN - 9780769546339
T3 - Proceedings - 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2012
SP - 185
EP - 189
BT - Proceedings - 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2012
T2 - 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2012
Y2 - 15 February 2012 through 17 February 2012
ER -