TY - GEN
T1 - Cross-failure bug detection in persistent memory programs
AU - Liu, Sihang
AU - Seemakhupt, Korakit
AU - Wei, Yizhou
AU - Wenisch, Thomas
AU - Kolli, Aasheesh
AU - Khan, Samira
N1 - Funding Information:
We thank our anonymous reviewers, Akhil Indurti, and Suyash Mahar for their valuable feedback. This work is supported by NFS and the SRC/DARPA Center for Research on Intelligent Storage and Processing-in-memory (CRISP).
Publisher Copyright:
© 2020 Association for Computing Machinery.
PY - 2020/3/9
Y1 - 2020/3/9
N2 - Persistent memory (PM) technologies, such as Intel's Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage - a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK's examples, a PM-optimized Redis database, and a PMDK library function.
AB - Persistent memory (PM) technologies, such as Intel's Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage - a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK's examples, a PM-optimized Redis database, and a PMDK library function.
UR - http://www.scopus.com/inward/record.url?scp=85082383924&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082383924&partnerID=8YFLogxK
U2 - 10.1145/3373376.3378452
DO - 10.1145/3373376.3378452
M3 - Conference contribution
AN - SCOPUS:85082383924
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 1187
EP - 1202
BT - ASPLOS 2020 - 25th International Conference on Architectural Support for Programming Languages and Operating Systems
PB - Association for Computing Machinery
T2 - 25th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020
Y2 - 16 March 2020 through 20 March 2020
ER -