TY - GEN
T1 - Inter-Rater Agreement for Social Computing Studies
AU - Salminen, Joni O.
AU - Al-Merekhi, Hind A.
AU - Dey, Partha
AU - Jansen, Bernard J.
N1 - Publisher Copyright:
© 2018 IEEE.
Copyright:
Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2018/11/30
Y1 - 2018/11/30
N2 - Different agreement scores are widely used in social computing studies to evaluate the reliability of crowdsourced ratings. In this research, we argue that the concept of agreement is problematic for many rating tasks in computational social science because they are characterized by subjectivity. We demonstrate this claim by analyzing four social computing datasets that are rated by crowd workers, showing that the agreement ratings are low despite deploying proper instructions and platform settings. Findings indicate that the more subjective the rating task, the lower the agreement, suggesting that tasks differ by their inherent subjectivity and that measuring the agreement of social computing tasks might not be the optimal way to ensure data quality. When creating sbjective tasks, the use of agreement metrics potentially gives a false picture of the consistency of crowd workers, as they over-simplify the reality of obtaining quality labels. We also provide empirical evidence on the stability of crowd ratings with a different number of raters, items, and categories, finding that the reliability s cores a re most sensitive to the number categories, somewhat less sensitive to the number of raters, and the least sensitive to the number of items. Our findings have implications for computational social scientists using crowdsourcing for data collection.
AB - Different agreement scores are widely used in social computing studies to evaluate the reliability of crowdsourced ratings. In this research, we argue that the concept of agreement is problematic for many rating tasks in computational social science because they are characterized by subjectivity. We demonstrate this claim by analyzing four social computing datasets that are rated by crowd workers, showing that the agreement ratings are low despite deploying proper instructions and platform settings. Findings indicate that the more subjective the rating task, the lower the agreement, suggesting that tasks differ by their inherent subjectivity and that measuring the agreement of social computing tasks might not be the optimal way to ensure data quality. When creating sbjective tasks, the use of agreement metrics potentially gives a false picture of the consistency of crowd workers, as they over-simplify the reality of obtaining quality labels. We also provide empirical evidence on the stability of crowd ratings with a different number of raters, items, and categories, finding that the reliability s cores a re most sensitive to the number categories, somewhat less sensitive to the number of raters, and the least sensitive to the number of items. Our findings have implications for computational social scientists using crowdsourcing for data collection.
UR - http://www.scopus.com/inward/record.url?scp=85060064485&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85060064485&partnerID=8YFLogxK
U2 - 10.1109/SNAMS.2018.8554744
DO - 10.1109/SNAMS.2018.8554744
M3 - Conference contribution
AN - SCOPUS:85060064485
T3 - 2018 5th International Conference on Social Networks Analysis, Management and Security, SNAMS 2018
SP - 80
EP - 87
BT - 2018 5th International Conference on Social Networks Analysis, Management and Security, SNAMS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Conference on Social Networks Analysis, Management and Security, SNAMS 2018
Y2 - 15 October 2018 through 18 October 2018
ER -