TY - GEN
T1 - Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora
AU - Liu, Ying
AU - Lita, Lucian V.
AU - Niculescu, R. Stefan
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2008
Y1 - 2008
N2 - Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well known Aho-Chorasick algorithm. The advantages of our algorithm include: the ability to exploit the natural structure of text, perform significant character shifting, avoid useless backtracking jumps, efficient matching time and avoid the typical "sub-string" false positive errors. Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms (the Aho-Corasick and the Wu-Manber algorithm).
AB - Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well known Aho-Chorasick algorithm. The advantages of our algorithm include: the ability to exploit the natural structure of text, perform significant character shifting, avoid useless backtracking jumps, efficient matching time and avoid the typical "sub-string" false positive errors. Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms (the Aho-Corasick and the Wu-Manber algorithm).
UR - http://www.scopus.com/inward/record.url?scp=52649134400&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=52649134400&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972788.61
DO - 10.1137/1.9781611972788.61
M3 - Conference contribution
AN - SCOPUS:52649134400
SN - 9781605603179
T3 - Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130
SP - 668
EP - 679
BT - Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130
PB - Society for Industrial and Applied Mathematics Publications
T2 - 8th SIAM International Conference on Data Mining 2008, Applied Mathematics 130
Y2 - 24 April 2008 through 26 April 2008
ER -