TY - GEN
T1 - HARRA
T2 - 13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010
AU - Kim, Hung Sik
AU - Lee, Dongwon
PY - 2010
Y1 - 2010
N2 - We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).
AB - We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).
UR - http://www.scopus.com/inward/record.url?scp=77952280581&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952280581&partnerID=8YFLogxK
U2 - 10.1145/1739041.1739104
DO - 10.1145/1739041.1739104
M3 - Conference contribution
AN - SCOPUS:77952280581
SN - 9781605589459
T3 - Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
SP - 525
EP - 536
BT - Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
Y2 - 22 March 2010 through 26 March 2010
ER -