TY - GEN
T1 - Quantifying Data Difficulty with Polarized K-Entropy for Assessing Machine Learning Models
AU - Afolabi, Ayomide
AU - Aygun, Ramazan
AU - Tran, Truong X.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Data difficulty level measurement is a critical aspect of machine learning performance evaluation. Several measures have been used to assess the difficulty level of classifying data points in binary classification. However, these measures typically involve building a machine learning model first, which is then used to assess the data difficulty level. In this paper, we propose a novel model agnostic measure named as polarized K-entropy to evaluate the difficulty of classifying a data instance. Our measure leverages the computation of entropy based on the nearest neighbors of a data point. We conducted experiments to evaluate the effectiveness of our proposed method by analyzing how the accuracy of machine learning models change with respect to data difficulty. We used Spearman's rank correlation coefficient to analyze this relationship for neural network, support vector machine, and random forest. Our results show that our measure outperformed the non-conformity measure in all the experiments conducted for six datasets using the selected machine learning models.
AB - Data difficulty level measurement is a critical aspect of machine learning performance evaluation. Several measures have been used to assess the difficulty level of classifying data points in binary classification. However, these measures typically involve building a machine learning model first, which is then used to assess the data difficulty level. In this paper, we propose a novel model agnostic measure named as polarized K-entropy to evaluate the difficulty of classifying a data instance. Our measure leverages the computation of entropy based on the nearest neighbors of a data point. We conducted experiments to evaluate the effectiveness of our proposed method by analyzing how the accuracy of machine learning models change with respect to data difficulty. We used Spearman's rank correlation coefficient to analyze this relationship for neural network, support vector machine, and random forest. Our results show that our measure outperformed the non-conformity measure in all the experiments conducted for six datasets using the selected machine learning models.
UR - https://www.scopus.com/pages/publications/85207840728
UR - https://www.scopus.com/pages/publications/85207840728#tab=citedBy
U2 - 10.1109/IRI62200.2024.00015
DO - 10.1109/IRI62200.2024.00015
M3 - Conference contribution
AN - SCOPUS:85207840728
T3 - Proceedings - 2024 IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024
SP - 7
EP - 12
BT - Proceedings - 2024 IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024
Y2 - 7 August 2024 through 9 August 2024
ER -