TY - GEN
T1 - Learning classifiers from chains of multiple interlinked RDF data stores
AU - Lin, Harris T.
AU - Honavar, Vasant
PY - 2013
Y1 - 2013
N2 - The emergence of many interlinked, physically distributed, and autonomously maintained RDF stores offers unprecedented opportunities for predictive modeling and knowledge discovery from such data. However existing machine learning approaches are limited in their applicability because it is neither desirable nor feasible to gather all of the data in a centralized location for analysis due to access, memory, bandwidth, computational restrictions, and sometimes privacy and confidentiality constraints. Against this background, we consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically we: (i) introduce statistical query based formulations of several representative algorithms for learning classifiers from RDF data, (ii) introduce a distributed learning framework to learn classifiers from multiple interlinked RDF stores that form a chain, (iii) identify three special cases of RDF data fragmentation and describe effective strategies for learning predictive models in each case, (iv) consider a novel application of a matrix reconstruction technique from the field of Computerized Tomography [1] to approximate the statistics needed by the learning algorithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner, and (v) report results of experiments with a real-world social network data set (Last.fm), which demonstrate the feasibility of the proposed approach.
AB - The emergence of many interlinked, physically distributed, and autonomously maintained RDF stores offers unprecedented opportunities for predictive modeling and knowledge discovery from such data. However existing machine learning approaches are limited in their applicability because it is neither desirable nor feasible to gather all of the data in a centralized location for analysis due to access, memory, bandwidth, computational restrictions, and sometimes privacy and confidentiality constraints. Against this background, we consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically we: (i) introduce statistical query based formulations of several representative algorithms for learning classifiers from RDF data, (ii) introduce a distributed learning framework to learn classifiers from multiple interlinked RDF stores that form a chain, (iii) identify three special cases of RDF data fragmentation and describe effective strategies for learning predictive models in each case, (iv) consider a novel application of a matrix reconstruction technique from the field of Computerized Tomography [1] to approximate the statistics needed by the learning algorithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner, and (v) report results of experiments with a real-world social network data set (Last.fm), which demonstrate the feasibility of the proposed approach.
UR - http://www.scopus.com/inward/record.url?scp=84885967020&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84885967020&partnerID=8YFLogxK
U2 - 10.1109/BigData.Congress.2013.22
DO - 10.1109/BigData.Congress.2013.22
M3 - Conference contribution
AN - SCOPUS:84885967020
SN - 9780768550060
T3 - Proceedings - 2013 IEEE International Congress on Big Data, BigData 2013
SP - 94
EP - 101
BT - Proceedings - 2013 IEEE International Congress on Big Data, BigData 2013
T2 - 2013 IEEE International Congress on Big Data, BigData 2013
Y2 - 27 June 2013 through 2 July 2013
ER -