TY - JOUR
T1 - Using Lexical Chains to Identify Text Difficulty
T2 - A Corpus Statistics and Classification Study
AU - Mukherjee, Partha
AU - Leroy, Gondy
AU - Kauchak, David
N1 - Funding Information:
Manuscript received January 17, 2018; revised March 30, 2018, June 27, 2018, August 23, 2018, and October 17, 2018; accepted December 3, 2018. Date of publication December 6, 2018; date of current version September 4, 2019. This work was supported by the National Library of Medicine of the National Institutes of Health under Award R01LM011975. (Corresponding author: Partha Mukherjee.) P. Mukherjee is with the Engineering Division, The Pennsylvania State University, Great Valley Campus, Malvern PA 19355 USA (e-mail:, [email protected]).
Publisher Copyright:
© 2013 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Our goal is data-driven discovery of features for text simplification. In this paper, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and a classification task (11 000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, naïve Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ∼90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.
AB - Our goal is data-driven discovery of features for text simplification. In this paper, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and a classification task (11 000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, naïve Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ∼90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.
UR - http://www.scopus.com/inward/record.url?scp=85058131390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058131390&partnerID=8YFLogxK
U2 - 10.1109/JBHI.2018.2885465
DO - 10.1109/JBHI.2018.2885465
M3 - Article
C2 - 30530380
AN - SCOPUS:85058131390
SN - 2168-2194
VL - 23
SP - 2164
EP - 2173
JO - IEEE Journal of Biomedical and Health Informatics
JF - IEEE Journal of Biomedical and Health Informatics
IS - 5
M1 - 8565884
ER -