TY - GEN
T1 - Tprof
T2 - 12th Annual ACM Symposium on Cloud Computing, SoCC 2021
AU - Huang, Lexiang
AU - Zhu, Timothy
N1 - Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/11/1
Y1 - 2021/11/1
N2 - The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.
AB - The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.
UR - http://www.scopus.com/inward/record.url?scp=85119253606&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119253606&partnerID=8YFLogxK
U2 - 10.1145/3472883.3486994
DO - 10.1145/3472883.3486994
M3 - Conference contribution
AN - SCOPUS:85119253606
T3 - SoCC 2021 - Proceedings of the 2021 ACM Symposium on Cloud Computing
SP - 76
EP - 91
BT - SoCC 2021 - Proceedings of the 2021 ACM Symposium on Cloud Computing
PB - Association for Computing Machinery, Inc
Y2 - 1 November 2021 through 4 November 2021
ER -