TY - GEN
T1 - Quantization for Bayesian Deep Learning
T2 - 26th IEEE International Symposium on Workload Characterization, IISWC 2023
AU - Lin, Jun Liang
AU - Krishnan, Ranganath
AU - Ranipa, Keyur Ruganathbhai
AU - Subedar, Mahesh
AU - Sanghavi, Vrushabh
AU - Arunachalam, Meena
AU - Tickoo, Omesh
AU - Iyer, Ravishankar
AU - Kandemir, Mahmut Taylan
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Bayesian Deep Learning is an emerging field for building robust and trustworthy AI systems due to its ability to estimate reliable uncertainty in neural networks. The need for modeling distribution over parameters and multiple Monte Carlo forward runs in Bayesian neural networks leads to larger model size and significant increase in inference latency compared to deterministic models, which poses challenges for practical deployment. Quantization is a technique that can reduce the model size and also speed up the inference through low-precision computation. In this work, we propose and evaluate a quantization framework and workflow for Bayesian deep learning workloads, which leverages 8-bit integer (INT8) operations to accelerate inference on the 4th Gen Intel Xeon scalable processor (formerly codenamed Sapphire Rapids). We demonstrate that our quantization workflow achieves 6.9x inference throughput speedup on the ImageNet benchmark without sacrificing the model accuracy and quality of uncertainty. Furthermore, we evaluate the effects of quantization on Bayesian neural networks w.r.t. generalizability, robustness against data drift, and its capability in uncertainty estimation on large-scale datasets including a real-world safety-critical application. Our code has been integrated into an open-source project and made available on GitHub at the following URL: https://github.com/IntelLabs/bayesian-torch.
AB - Bayesian Deep Learning is an emerging field for building robust and trustworthy AI systems due to its ability to estimate reliable uncertainty in neural networks. The need for modeling distribution over parameters and multiple Monte Carlo forward runs in Bayesian neural networks leads to larger model size and significant increase in inference latency compared to deterministic models, which poses challenges for practical deployment. Quantization is a technique that can reduce the model size and also speed up the inference through low-precision computation. In this work, we propose and evaluate a quantization framework and workflow for Bayesian deep learning workloads, which leverages 8-bit integer (INT8) operations to accelerate inference on the 4th Gen Intel Xeon scalable processor (formerly codenamed Sapphire Rapids). We demonstrate that our quantization workflow achieves 6.9x inference throughput speedup on the ImageNet benchmark without sacrificing the model accuracy and quality of uncertainty. Furthermore, we evaluate the effects of quantization on Bayesian neural networks w.r.t. generalizability, robustness against data drift, and its capability in uncertainty estimation on large-scale datasets including a real-world safety-critical application. Our code has been integrated into an open-source project and made available on GitHub at the following URL: https://github.com/IntelLabs/bayesian-torch.
UR - https://www.scopus.com/pages/publications/85177566189
UR - https://www.scopus.com/pages/publications/85177566189#tab=citedBy
U2 - 10.1109/IISWC59245.2023.00020
DO - 10.1109/IISWC59245.2023.00020
M3 - Conference contribution
AN - SCOPUS:85177566189
T3 - Proceedings - 2023 IEEE International Symposium on Workload Characterization, IISWC 2023
SP - 180
EP - 192
BT - Proceedings - 2023 IEEE International Symposium on Workload Characterization, IISWC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 October 2023 through 3 October 2023
ER -