TY - GEN
T1 - Evaluating the usefulness of content addressable Storage For high-performance data intensive applications
AU - Nath, Partho
AU - Urgaonkar, Bhuvan
AU - Sivasubramaniam, Anand
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2008
Y1 - 2008
N2 - Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-eommerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.
AB - Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-eommerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.
UR - http://www.scopus.com/inward/record.url?scp=57349192776&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57349192776&partnerID=8YFLogxK
U2 - 10.1145/1383422.1383428
DO - 10.1145/1383422.1383428
M3 - Conference contribution
AN - SCOPUS:57349192776
SN - 9781595939975
SN - 9781595939975
T3 - Proceedings of the 17th International Symposium on High Performance Distributed Computing 2008, HPDC'08
SP - 35
EP - 44
BT - Proceedings of the 17th International Symposium on High Performance Distributed Computing 2008, HPDC'08
T2 - 17th International Symposium on High Performance Distributed Computing 2008, HPDC'08
Y2 - 23 June 2008 through 27 June 2008
ER -