TY - GEN
T1 - Mixed-precision block gram schmidt orthogonalization
AU - Yamazaki, Ichitaro
AU - Tomov, Stanimire
AU - Kurzak, Jakub
AU - Dongarra, Jack
AU - Barlow, Jesse
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/11/15
Y1 - 2015/11/15
N2 - The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the softwareemulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the softwareemulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixedprecision CholQR, such a block variant can obtain speedups of up to 7:1× while maintaining about the same order of the numerical errors.
AB - The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the softwareemulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the softwareemulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixedprecision CholQR, such a block variant can obtain speedups of up to 7:1× while maintaining about the same order of the numerical errors.
UR - http://www.scopus.com/inward/record.url?scp=84968627043&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84968627043&partnerID=8YFLogxK
U2 - 10.1145/2832080.2832082
DO - 10.1145/2832080.2832082
M3 - Conference contribution
AN - SCOPUS:84968627043
T3 - Proceedings of ScalA 2015: 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of ScalA 2015
PB - Association for Computing Machinery, Inc
T2 - 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2015
Y2 - 15 November 2015 through 20 November 2015
ER -