Token and Head Adaptive Transformers for Efficient Natural Language Processing

Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


While pre-trained language models like BERT (Devlin et al., 2019) have achieved impressive results on various natural language processing tasks, deploying them on resource-restricted devices is challenging due to their intensive computational cost and memory footprint. Previous approaches mainly focused on training smaller versions of a BERT model with competitive accuracy under limited computational resources. In this paper, we extend Length Adaptive Transformer (Kim and Cho, 2021) and propose to design Token and Head Adaptive Transformer, which can compress and accelerate various BERT-based models via simple fine-tuning. We train a transformer with a progressive token and head pruning scheme, eliminating a large number of redundant tokens and attention heads in the later layers. Then, we conduct a multi-objective evolutionary search with the overall number of floating point operations (FLOPs) as its efficiency constraint to find joint token and head pruning strategies that maximize accuracy and efficiency under various computational budgets. Empirical studies show that a large portion of tokens and attention heads could be pruned while achieving superior performance compared to the baseline BERT-based models and Length Adaptive Transformers in various downstream NLP tasks. MobileBERT(Sun et al., 2020) trained with our joint token and head pruning scheme achieves a GLUE score of 83.0, which is 1.4 higher than Length Adaptive Transformer and 2.9 higher than the original model.

Original languageEnglish (US)
Pages (from-to)4575-4584
Number of pages10
JournalProceedings - International Conference on Computational Linguistics, COLING
Issue number1
StatePublished - 2022
Event29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of
Duration: Oct 12 2022Oct 17 2022

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Theoretical Computer Science

Cite this