TY - JOUR
T1 - Just-in-time analytics on large file systems
AU - Huang, H. Howie
AU - Zhang, Nan
AU - Wang, Wei
AU - Das, Gautam
AU - Szalay, Alexander S.
N1 - Funding Information:
The authors thank the anonymous reviewers from IEEE Transactions on Computers, FAST ’11 reviewers, and our FAST shepherd John Bent, for their suggestions that helped significantly improve this paper. They owe them a great deal of gratitude. They also thank Hong Jiang and Yifeng Zhu for their help on replaying the NFS trace, and Ron C. Chiang for his help on the artwork. This work was supported by the US National Science Foundation (NSF) grants OCI-0937875, OCI-0937947, IIS-0845644, CCF-0852674, CNS-0852673, and CNS-0915834. A preliminary version of this paper appeared in the 9th USENIX Conference on File and Storage Technologies (FAST ’11).
PY - 2012
Y1 - 2012
N2 - As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing preprocessing-based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate-i.e., statistically accurate-answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy, and scalability of Glance.
AB - As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing preprocessing-based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate-i.e., statistically accurate-answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy, and scalability of Glance.
UR - http://www.scopus.com/inward/record.url?scp=84867298672&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84867298672&partnerID=8YFLogxK
U2 - 10.1109/TC.2011.186
DO - 10.1109/TC.2011.186
M3 - Article
AN - SCOPUS:84867298672
SN - 0018-9340
VL - 61
SP - 1651
EP - 1664
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 11
M1 - 6035676
ER -