Abstract
Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predefined or to heretofore undiscovered classes. There are several practical situations where such data may arise. We propose a novel statistical mixture model which views as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. "Nonpredefined" components only generate unlabeled points - thus, in localized regions, they capture data subsets that are exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefmed natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classification with rejections; 3) identification of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. We evaluate our method and alternative approaches on both synthetic and real-world data sets.
Original language | English (US) |
---|---|
Pages (from-to) | 809-812 |
Number of pages | 4 |
Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
Volume | 2 |
State | Published - 2003 |
Event | 2003 IEEE International Conference on Accoustics, Speech, and Signal Processing - Hong Kong, Hong Kong Duration: Apr 6 2003 → Apr 10 2003 |
All Science Journal Classification (ASJC) codes
- Software
- Signal Processing
- Electrical and Electronic Engineering