TY - GEN
T1 - Prompting in the Dark
T2 - 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025
AU - He, Zeyu
AU - Naphade, Saniya
AU - Huang, Ting Hao Kenneth
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/4/26
Y1 - 2025/4/26
N2 - Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark,"where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable - only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.
AB - Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark,"where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable - only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.
UR - https://www.scopus.com/pages/publications/105005741605
UR - https://www.scopus.com/pages/publications/105005741605#tab=citedBy
U2 - 10.1145/3706598.3714319
DO - 10.1145/3706598.3714319
M3 - Conference contribution
AN - SCOPUS:105005741605
T3 - Conference on Human Factors in Computing Systems - Proceedings
BT - CHI 2025 - Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems
PB - Association for Computing Machinery
Y2 - 26 April 2025 through 1 May 2025
ER -