Evaluating the usefulness of content addressable Storage For high-performance data intensive applications

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Scopus citations

Abstract

Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-eommerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.

Original languageEnglish (US)
Title of host publicationProceedings of the 17th International Symposium on High Performance Distributed Computing 2008, HPDC'08
Pages35-44
Number of pages10
DOIs
StatePublished - 2008
Event17th International Symposium on High Performance Distributed Computing 2008, HPDC'08 - Boston, MA, United States
Duration: Jun 23 2008Jun 27 2008

Publication series

NameProceedings of the 17th International Symposium on High Performance Distributed Computing 2008, HPDC'08

Other

Other17th International Symposium on High Performance Distributed Computing 2008, HPDC'08
Country/TerritoryUnited States
CityBoston, MA
Period6/23/086/27/08

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software

Cite this