TY - JOUR
T1 - Why is Data Misclassified? Quantifying Data Difficulty for Machine Learning Models with Dual Probability Difficulty Measure
AU - Afolabi, Ayomide
AU - Aygun, Ramazan
AU - Tran, Truong X.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2025
Y1 - 2025
N2 - Despite the high accuracy of machine learning models, misclassification of data may raise issues regarding the trustworthiness of these models. There are a number of reasons for misclassification, such as model complexity and the difference between the distributions of training data and evaluation data. In this paper, we investigate whether specific data could be hard to classify correctly based on the training dataset or not. The development of measures to quantify the difficulty of data provides insight into assessing the potential performance of such models. However, most of the existing data difficulty measures, which are typically model-based, suffer from various model biases, while the few existing model-agnostic measures do not yield consistent results across datasets and machine learning models. In this paper, we propose a novel model-agnostic measure named Dual Probability Difficulty Measure. This measure considers the probability of the data instance label and the probability of the most frequent label that does not share the same label as the data instance in the K-nearest neighbors to assess the data difficulty. We conducted experiments to evaluate the accuracy of machine learning models with respect to different data difficulty levels and utilized the area under the curve to assess the effectiveness of our proposed method compared to other methods. Our experimental results on five diverse datasets across various machine learning models show that our Dual Probability Difficulty Measure has a correlation with model performance and identifies data likely to be misclassified.
AB - Despite the high accuracy of machine learning models, misclassification of data may raise issues regarding the trustworthiness of these models. There are a number of reasons for misclassification, such as model complexity and the difference between the distributions of training data and evaluation data. In this paper, we investigate whether specific data could be hard to classify correctly based on the training dataset or not. The development of measures to quantify the difficulty of data provides insight into assessing the potential performance of such models. However, most of the existing data difficulty measures, which are typically model-based, suffer from various model biases, while the few existing model-agnostic measures do not yield consistent results across datasets and machine learning models. In this paper, we propose a novel model-agnostic measure named Dual Probability Difficulty Measure. This measure considers the probability of the data instance label and the probability of the most frequent label that does not share the same label as the data instance in the K-nearest neighbors to assess the data difficulty. We conducted experiments to evaluate the accuracy of machine learning models with respect to different data difficulty levels and utilized the area under the curve to assess the effectiveness of our proposed method compared to other methods. Our experimental results on five diverse datasets across various machine learning models show that our Dual Probability Difficulty Measure has a correlation with model performance and identifies data likely to be misclassified.
UR - https://www.scopus.com/pages/publications/105019088845
UR - https://www.scopus.com/pages/publications/105019088845#tab=citedBy
U2 - 10.1109/TAI.2025.3612934
DO - 10.1109/TAI.2025.3612934
M3 - Article
AN - SCOPUS:105019088845
SN - 2691-4581
JO - IEEE Transactions on Artificial Intelligence
JF - IEEE Transactions on Artificial Intelligence
ER -