PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking

  • Yan Zhou
  • , Chunwei Liu
  • , Bhuvan Urgaonkar
  • , Zhengle Wang
  • , Magnus Mueller
  • , Chao Zhang
  • , Songyue Zhang
  • , Pascal Pfeil
  • , Dominik Horn
  • , Zhengchun Liu
  • , Davide Pagano
  • , Tim Kraska
  • , Samuel Madden
  • , Ju Fan

Research output: Contribution to journalConference articlepeer-review

Abstract

Cloud service providers commonly use standard benchmarks like TPC-H and TPC-DS to evaluate and optimize cloud data analytics systems. However, these benchmarks rely on !xed query patterns and fail to capture real execution statistics of production cloud workloads. Although some cloud database vendors have recently released real workload traces, these traces alone do not qualify as benchmarks, as they typically lack essential components (i.e., queries and databases). To overcome this limitation, this paper studies a new problem of workload synthesis with real statistics, which generates synthetic workloads that closely approximate real execution statistics, including key performance metrics and operator distributions. To address this problem, we propose PB!"#$, a novel workload synthesizer that constructs synthetic workloads by (1) selecting and combining workload components from existing benchmarks and (2) augmenting new workload components. This paper studies the key challenges in PB!"#$. First, we address the challenge of balancing performance metrics and operator distributions by introducing a multi-objective optimization-based component selection method. Second, to capture the temporal dynamics of real workloads, we design a timestamp assignment method that progressively re!nes workload timestamps. Third, to handle the disparity between the original workload and the candidate workload, we propose a component augmentation approach that leverages large language models (LLMs) to generate additional workload components while maintaining statistical !delity. Experimental results show that PB!"#$ reduces approximation error by up to 6³ compared to state-of-the-art methods.

Original languageEnglish (US)
Pages (from-to)3883-3895
Number of pages13
JournalProceedings of the VLDB Endowment
Volume18
Issue number11
DOIs
StatePublished - 2025
Event51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom
Duration: Sep 1 2025Sep 5 2025

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking'. Together they form a unique fingerprint.

Cite this