TY - GEN
T1 - Data layout optimization for GPGPU architectures
AU - Liu, Jun
AU - Ding, Wei
AU - Jang, Ohyoung
AU - Kandemir, Mahmut
PY - 2013
Y1 - 2013
N2 - GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing high-performance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.
AB - GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing high-performance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.
UR - http://www.scopus.com/inward/record.url?scp=84875156075&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84875156075&partnerID=8YFLogxK
U2 - 10.1145/2442516.2442546
DO - 10.1145/2442516.2442546
M3 - Conference contribution
AN - SCOPUS:84875156075
SN - 9781450319225
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 283
EP - 284
BT - PPoPP 2013 - Proceedings of the 2013 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
T2 - 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013
Y2 - 23 February 2013 through 27 February 2013
ER -