TY - GEN
T1 - Inventor name disambiguation for a patent database using a random forest and DBSCAN
AU - Kim, Kunho
AU - Khabsa, Madian
AU - Giles, C. Lee
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/9/1
Y1 - 2016/9/1
N2 - Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.
AB - Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.
UR - http://www.scopus.com/inward/record.url?scp=84989831750&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84989831750&partnerID=8YFLogxK
U2 - 10.1145/2910896.2925465
DO - 10.1145/2910896.2925465
M3 - Conference contribution
AN - SCOPUS:84989831750
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 269
EP - 270
BT - JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016
Y2 - 19 June 2016 through 23 June 2016
ER -