TY - GEN
T1 - Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources
AU - Caragea, Doina
AU - Zhang, Jun
AU - Bao, Jie
AU - Pathak, Jyotishman
AU - Honavar, Vasant
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2005
Y1 - 2005
N2 - Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.
AB - Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.
UR - http://www.scopus.com/inward/record.url?scp=33646515517&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33646515517&partnerID=8YFLogxK
U2 - 10.1007/11564089_5
DO - 10.1007/11564089_5
M3 - Conference contribution
AN - SCOPUS:33646515517
SN - 354029242X
SN - 9783540292425
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 13
EP - 44
BT - Algorithmic Learning Theory - 16th International Conference, ALT 2005, Proceedings
T2 - 16th International Conference on Algorithmic Learning Theory, ALT 2005
Y2 - 8 October 2005 through 11 October 2005
ER -