RSenter: Tool for topics and terms extraction from unstructured data debris

Richard K. Lomotey, Ralph Deters

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

There is enormous volume of user generated content (data) today in open source repositories, online social networks, and so on that enterprises can feed on to enhance product and services delivery. Apart from the open source data, enterprises are also generating a lot of data in-house since modern business requirements are shifting from paper-base to digital records. The major setback however is that, the data is unstructured in the sense that it is in heterogeneous formats (different file types including multimedia files), it is schema less, and it is scattered on multiple sources. This condition makes knowledge discovery (a.k.a. data mining) very challenging. Previous studies have proposed the hierarchical clustering methodology since it enhances human readability and provides clear dependency structure through topics, term and document organization. But, the methodology can be resource intensive and time consuming. Our work investigates the methodology and proposes a tool called RSenter that searches based on parallelization, random walk (or linear search), pessimistic search, and optimistic search in order to generate the hierarchical structure in real time within a search space. Currently, RSenter can search through NoSQL databases and HTML documents and traverse through all the links that are connected to that HTML to the nth depth, extracting the entire user specified elements (topics and terms). Further, the tool can search through an entire repository and organize the files in a hierarchical structure regardless of the file formats.

Original languageEnglish (US)
Title of host publicationProceedings - 2013 IEEE International Congress on Big Data, BigData 2013
Pages395-402
Number of pages8
DOIs
StatePublished - 2013
Event2013 IEEE International Congress on Big Data, BigData 2013 - Santa Clara, CA, United States
Duration: Jun 27 2013Jul 2 2013

Publication series

NameProceedings - 2013 IEEE International Congress on Big Data, BigData 2013

Other

Other2013 IEEE International Congress on Big Data, BigData 2013
Country/TerritoryUnited States
CitySanta Clara, CA
Period6/27/137/2/13

All Science Journal Classification (ASJC) codes

  • Computer Science Applications

Fingerprint

Dive into the research topics of 'RSenter: Tool for topics and terms extraction from unstructured data debris'. Together they form a unique fingerprint.

Cite this