TY - GEN
T1 - How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective
AU - Xiao, Teng
AU - Li, Mingxiao
AU - Yuan, Yige
AU - Zhu, Huaisheng
AU - Cui, Chao
AU - Honavar, Vasant G.
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - This paper introduces a novel generalized self-imitation learning (GSIL) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop GSIL by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. GSIL eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, GSIL encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that GSIL consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench). Code is public available at https://github.com/tengxiao1/GSIL.
AB - This paper introduces a novel generalized self-imitation learning (GSIL) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop GSIL by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. GSIL eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, GSIL encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that GSIL consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench). Code is public available at https://github.com/tengxiao1/GSIL.
UR - http://www.scopus.com/inward/record.url?scp=85208037270&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85208037270&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.emnlp-main.744
DO - 10.18653/v1/2024.emnlp-main.744
M3 - Conference contribution
AN - SCOPUS:85208037270
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 13413
EP - 13426
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Y2 - 12 November 2024 through 16 November 2024
ER -