TY - GEN
T1 - Exploring Extreme Quantization in Spiking Language Models
AU - Bal, Malyaban
AU - Jiang, Yi
AU - Sengupta, Abhronil
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Despite the growing prevalence of large language model (LLM) architectures, a crucial concern persists regarding their energy and power consumption, which still lags far behind the remarkable energy efficiency of the human brain. Recent strides in spiking language models (LM) and transformer architectures aim to address this concern by harnessing the spiking activity of biological neurons to enhance energy/power efficiency. Doubling down on the principles of model quantization and energy efficiency, this paper proposes the development of a novel binary/ternary (1/1.58-bit) spiking LM architecture. Achieving scalability comparable to a deep spiking LM architecture is facilitated by an efficient knowledge distillation technique, wherein knowledge from a non-spiking full-precision 'teacher' model is transferred to an extremely weight quantized spiking 'student' LM. Our proposed model represents a significant advancement as the first-of-its-kind 1/1.58-bit spiking LM, and its performance is rigorously evaluated on multiple text classification tasks of the GLUE benchmark.
AB - Despite the growing prevalence of large language model (LLM) architectures, a crucial concern persists regarding their energy and power consumption, which still lags far behind the remarkable energy efficiency of the human brain. Recent strides in spiking language models (LM) and transformer architectures aim to address this concern by harnessing the spiking activity of biological neurons to enhance energy/power efficiency. Doubling down on the principles of model quantization and energy efficiency, this paper proposes the development of a novel binary/ternary (1/1.58-bit) spiking LM architecture. Achieving scalability comparable to a deep spiking LM architecture is facilitated by an efficient knowledge distillation technique, wherein knowledge from a non-spiking full-precision 'teacher' model is transferred to an extremely weight quantized spiking 'student' LM. Our proposed model represents a significant advancement as the first-of-its-kind 1/1.58-bit spiking LM, and its performance is rigorously evaluated on multiple text classification tasks of the GLUE benchmark.
UR - https://www.scopus.com/pages/publications/85214668284
UR - https://www.scopus.com/inward/citedby.url?scp=85214668284&partnerID=8YFLogxK
U2 - 10.1109/ICONS62911.2024.00047
DO - 10.1109/ICONS62911.2024.00047
M3 - Conference contribution
AN - SCOPUS:85214668284
T3 - Proceedings - 2024 International Conference on Neuromorphic Systems, ICONS 2024
SP - 272
EP - 276
BT - Proceedings - 2024 International Conference on Neuromorphic Systems, ICONS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Conference on Neuromorphic Systems, ICONS 2024
Y2 - 30 July 2024 through 2 August 2024
ER -