TY - GEN
T1 - Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers
AU - Sampson, Jack
AU - González, Rubén
AU - Collard, Jean Francois
AU - Jouppi, Norman P.
AU - Schlansker, Mike
AU - Calder, Brad
PY - 2006
Y1 - 2006
N2 - We examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of different barrier mechanisms upon the efficiency of this approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we present barrier filters, a mechanism for fast barrier synchronization on chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed, and, by placing additional hardware in the shared portions of the memory subsytem, we starve their requests until they all have arrived. Specifically, our approach uses invalidation requests to both make cache lines unavailable and identify when a thread has reached the barrier. We examine two types of barrier filters, one synchronizing through instruction cache lines, and the other through data cache lines.
AB - We examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of different barrier mechanisms upon the efficiency of this approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we present barrier filters, a mechanism for fast barrier synchronization on chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed, and, by placing additional hardware in the shared portions of the memory subsytem, we starve their requests until they all have arrived. Specifically, our approach uses invalidation requests to both make cache lines unavailable and identify when a thread has reached the barrier. We examine two types of barrier filters, one synchronizing through instruction cache lines, and the other through data cache lines.
UR - http://www.scopus.com/inward/record.url?scp=40349086066&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=40349086066&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2006.23
DO - 10.1109/MICRO.2006.23
M3 - Conference contribution
AN - SCOPUS:40349086066
SN - 0769527329
SN - 9780769527321
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 235
EP - 246
BT - Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-39
T2 - 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-39
Y2 - 9 December 2006 through 13 December 2006
ER -