TY - GEN
T1 - Learning parameters of the K-means algorithm from subjective human annotation
AU - Dutta, Haimonti
AU - Passonneau, Rebecca J.
AU - Lee, Austin
AU - Radeva, Axinia
AU - Xie, Boyi
AU - Waltz, David
AU - Taranto, Barbara
PY - 2011/9/9
Y1 - 2011/9/9
N2 - The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled "editorial" without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled "editorial" by the OCR engine.
AB - The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled "editorial" without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled "editorial" by the OCR engine.
UR - http://www.scopus.com/inward/record.url?scp=80052406415&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052406415&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:80052406415
SN - 9781577355014
T3 - Proceedings of the 24th International Florida Artificial Intelligence Research Society, FLAIRS - 24
SP - 465
EP - 470
BT - Proceedings of the 24th International Florida Artificial Intelligence Research Society, FLAIRS - 24
T2 - 24th International Florida Artificial Intelligence Research Society, FLAIRS - 24
Y2 - 18 May 2011 through 20 May 2011
ER -