Different agreement scores are widely used in social computing studies to evaluate the reliability of crowdsourced ratings. In this research, we argue that the concept of agreement is problematic for many rating tasks in computational social science because they are characterized by subjectivity. We demonstrate this claim by analyzing four social computing datasets that are rated by crowd workers, showing that the agreement ratings are low despite deploying proper instructions and platform settings. Findings indicate that the more subjective the rating task, the lower the agreement, suggesting that tasks differ by their inherent subjectivity and that measuring the agreement of social computing tasks might not be the optimal way to ensure data quality. When creating sbjective tasks, the use of agreement metrics potentially gives a false picture of the consistency of crowd workers, as they over-simplify the reality of obtaining quality labels. We also provide empirical evidence on the stability of crowd ratings with a different number of raters, items, and categories, finding that the reliability s cores a re most sensitive to the number categories, somewhat less sensitive to the number of raters, and the least sensitive to the number of items. Our findings have implications for computational social scientists using crowdsourcing for data collection.