This paper studies the fundamental limits of storage for structured data, where statistics and structure are both critical to the application. Accordingly, a framework is proposed for optimal lossless and lossy compression of subsets of the possible realizations of a discrete memoryless source (DMS). For the lossless subset-compression problem, it turns out that the optimal source code may not index the conventional source-typical sequences, but rather index certain subset-typical sequences consistent with the subset of interest. Building upon an achievability and a strong converse, an analytic expression is given, based on the Shannon entropy, relative entropy, and subset entropy, which identifies such subset-typical sequences for a broad class of subsets of a DMS. Intuitively, subset-typical sequences belong to those typical sets which highly intersect the subset of interest but are still closest to the source distribution in the sense of relative entropy. For the lossy subset-compression problem, an upper bound is derived on the subset rate-distortion function in terms of the subset mutual information optimized over the set of conditional distributions that satisfy the expected distortion constraint with respect to the subset-typical distribution and over a set of certain auxiliary subsets. By proving a strong converse result, this upper bound is shown to be tight for a class of symmetric subsets. As shown in our numerical examples, more often than not, one achieves a gain in the fundamental limits, in that the optimal compression rate for the subset in both the lossless and lossy settings can be strictly smaller than the source entropy and the source rate-distortion function, respectively, although exceptions are also possible.
All Science Journal Classification (ASJC) codes
- Information Systems
- Computer Science Applications
- Library and Information Sciences