TY - GEN
T1 - Real-time data pre-processing technique for efficient feature extraction in large scale datasets
AU - Liu, Ying
AU - Lita, Lucian V.
AU - Niculescu, R. Stefan
AU - Bai, Kun
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2008
Y1 - 2008
N2 - Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.
AB - Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.
UR - http://www.scopus.com/inward/record.url?scp=70349248441&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70349248441&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458211
DO - 10.1145/1458082.1458211
M3 - Conference contribution
AN - SCOPUS:70349248441
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 981
EP - 990
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -