Data samples collected for training machine learning models are typically assumed to be independent and identically distributed (i.i.d.). Recent research has demonstrated that this assumption can be problematic as it simplifies the manifold of structured data. This has motivated different research areas such as data poisoning, model improvement, and explanation of machine learning models. In this work, we study the influence of a sample on determining the intrinsic topological features of its underlying manifold. We propose the Shapley homology framework, which provides a quantitative metric for the influence of a sample of the homology of a simplicial complex. Our proposed framework consists of two main parts: homology analysis, where we compute the Betti number of the target topological space, and Shapley value calculation, where we decompose the topological features of a complex built from data points to individual points. By interpreting the influence as a probability measure, we further define an entropy that reflects the complexity of the data manifold. Furthermore, we provide a preliminary discussion of the connection of the Shapley homology to the Vapnik-Chervonenkis dimension. Empirical studies show that when the zero-dimensional Shapley homology is used on neighboring graphs, samples with higher influence scores have a greater impact on the accuracy of neural networks that determine graph connectivity and on several regular grammars whose higher entropy values imply greater difficulty in being learned.
All Science Journal Classification (ASJC) codes
- Arts and Humanities (miscellaneous)
- Cognitive Neuroscience