TY - GEN
T1 - Classifying and ranking search engine results as potential sources of plagiarism
AU - Williams, Kyle
AU - Chen, Hung Hsuan
AU - Giles, C. Lee
N1 - Publisher Copyright:
© 2014 ACM.
PY - 2014
Y1 - 2014
N2 - Source retrieval for plagiarism detection involves using a search engine to retrieve candidate sources of plagiarism for a given suspicious document so that more accurate comparisons can be made. An important consideration is that only documents that are likely to be sources of plagiarism should be retrieved so as to minimize the number of unnecessary comparisons made. A supervised strategy for source retrieval is described whereby search results are classified and ranked as potential sources of plagiarism without retrieving the search result documents and using only the information available at search time. The performance of the supervised method is compared to a baseline method and shown to improve precision by up to 3.28%, recall by up to 2.6% and the F1 score by up to 3.37%. Furthermore, features are analyzed to determine which of them are most important for search result classification with features based on document and search result similarity appearing to be the most important.
AB - Source retrieval for plagiarism detection involves using a search engine to retrieve candidate sources of plagiarism for a given suspicious document so that more accurate comparisons can be made. An important consideration is that only documents that are likely to be sources of plagiarism should be retrieved so as to minimize the number of unnecessary comparisons made. A supervised strategy for source retrieval is described whereby search results are classified and ranked as potential sources of plagiarism without retrieving the search result documents and using only the information available at search time. The performance of the supervised method is compared to a baseline method and shown to improve precision by up to 3.28%, recall by up to 2.6% and the F1 score by up to 3.37%. Furthermore, features are analyzed to determine which of them are most important for search result classification with features based on document and search result similarity appearing to be the most important.
UR - http://www.scopus.com/inward/record.url?scp=84908614059&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84908614059&partnerID=8YFLogxK
U2 - 10.1145/2644866.2644879
DO - 10.1145/2644866.2644879
M3 - Conference contribution
AN - SCOPUS:84908614059
T3 - DocEng 2014 - Proceedings of the 2014 ACM Symposium on Document Engineering
SP - 97
EP - 106
BT - DocEng 2014 - Proceedings of the 2014 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
T2 - 2014 ACM Symposium on Document Engineering, DocEng 2014
Y2 - 16 September 2014 through 19 September 2014
ER -