Will they blend? Exploring big data computation atop traditional HPC NAS storage

Ellis H. Wilson, Mahmut T. Kandemir, Garth Gibson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations


The Apache Hadoop framework has rung in a new era in how data-rich organizations can process, store, and analyze large amounts of data. This has resulted in increased potential for an infrastructure exodus from the traditional solution of commercial database ad-hoc analytics on network-attached storage (NAS). While many data-rich organizations can afford to either move entirely to Hadoop for their Big Data analytics, or to maintain their existing traditional infrastructures and acquire a new set of infrastructure solely for Hadoop jobs, most supercomputing centers do not enjoy either of those possibilities. Too much of the existing scientific code is tailored to work on massively parallel file systems unlike the Hadoop Distributed File System (HDFS), and their datasets are too large to reasonably maintain and/or ferry between two distinct storage systems. Nevertheless, as scientists search for easier-to-program frameworks with a lower time-to-science to post-process their huge datasets after execution, there is increasing pressure to enable use of MapReduce within these traditional High Performance Computing (HPC) architectures. Therefore, in this work we explore potential means to enable use of the easy-to-program Hadoop MapReduce framework without requiring a complete infrastructure overhaul from existing HPC NAS solutions. We demonstrate that retaining function-dedicated resources like NAS is not only possible, but can even be effected efficiently with MapReduce. In our exploration, we unearth subtle pitfalls resultant from this mash-up of new-era Big Data computation on conventional HPC storage and share the clever architectural configurations that allow us to avoid them. Last, we design and present a novel Hadoop File System, the Reliable Array of Independent NAS File System (RainFS), and experimentally demonstrate its improvements in performance and reliability over the previous architectures we have investigated.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference on Distributed Computing Systems
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages11
ISBN (Electronic)9781479951680
StatePublished - Aug 29 2014
Event2014 IEEE 34th International Conference on Distributed Computing Systems, ICDCS 2014 - Madrid, Spain
Duration: Jun 30 2014Jul 3 2014

Publication series

NameProceedings - International Conference on Distributed Computing Systems


Other2014 IEEE 34th International Conference on Distributed Computing Systems, ICDCS 2014

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications


Dive into the research topics of 'Will they blend? Exploring big data computation atop traditional HPC NAS storage'. Together they form a unique fingerprint.

Cite this