TY - JOUR
T1 - Block-suffix shifting
T2 - fast, simultaneous medical concept set identification in large medical record corpora.
AU - Liu, Ying
AU - Lita, Lucian Vlad
AU - Niculescu, Radu Stefan
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2008
Y1 - 2008
N2 - Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the ability to exploit the natural structure of text, the ability to perform significant character shifting, avoiding backtracking jumps that are not useful, efficiency in terms of matching time and avoiding the typical "sub-string" false positive errors.Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms.
AB - Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the ability to exploit the natural structure of text, the ability to perform significant character shifting, avoiding backtracking jumps that are not useful, efficiency in terms of matching time and avoiding the typical "sub-string" false positive errors.Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms.
UR - http://www.scopus.com/inward/record.url?scp=73949129806&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=73949129806&partnerID=8YFLogxK
M3 - Article
C2 - 18999282
AN - SCOPUS:73949129806
SN - 1559-4076
SP - 424
EP - 428
JO - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
JF - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
ER -