TY - JOUR
T1 - Thread vulnerability in parallel applications
AU - Oz, Isil
AU - Topcuoglu, Haluk Rahmi
AU - Kandemir, Mahmut
AU - Tosun, Oguz
N1 - Funding Information:
This research was supported by The Scientific and Technological Research Council of Turkey (TUBITAK) with a research grant (Project Number: 108E035).
Funding Information:
Mahmut Kandemir is a professor in the Computer Science and Engineering Department at the Pennsylvania State University. He is a member of the Microsystems Design Lab. Dr. Kandemir’s research interests are in optimizing compilers, runtime systems, embedded systems, I/O and high performance storage, and power-aware computing. He is the author of more than 80 journal publications and over 300 conference/workshop papers in these areas. He has graduated 11 Ph.D. and 8 masters students so far, and is currently supervising 15 Ph.D. students and 1 masters student. He has served in the program committees of 40 conferences and workshops. His research is funded by NSF, DARPA, and SRC. He is a recipient of NSF Career Award and the Penn State Engineering Society Outstanding Research Award. He currently serves as the Graduate Coordinator of the Computer Science and Engineering Department at Penn State.
Copyright:
Copyright 2012 Elsevier B.V., All rights reserved.
PY - 2012/10
Y1 - 2012/10
N2 - Continuously reducing transistor sizes and aggressive low power operating modes employed by modern architectures tend to increase transient error rates. Concurrently, multicore machines are dominating the architectural spectrum today in various application domains. These two trends require a fresh look at resiliency of multithreaded applications against transient errors from a software perspective. In this paper, we propose and evaluate a new metric called the Thread Vulnerability Factor (TVF). A distinguishing characteristic of TVF is that its calculation for a given thread (which is typically one of the threads of a multithreaded application) does not depend on its code alone, but also on the codes of the threads that share resources and data with that thread. As a result, we decompose TVF of a thread into two complementary parts: local and remote. While the former captures the TVF induced by the code of the target thread, the latter represents the vulnerability impact of the threads that interact with the target thread. We quantify the local and remote TVF values for three architectural components (register file, ALUs, and caches) using a set of ten multithreaded applications from the Parsec and Splash-2 benchmark suites. Our experimental evaluation shows that TVF values tend to increase as the number of cores increases, which means the system becomes more vulnerable as the core count rises. We further discuss how TVF metric can be employed to explore performance-reliability tradeoffs in multicores. Reliability-based analysis of compiler optimizations and redundancy-based fault tolerance are also mentioned as potential usages of our TVF metric.
AB - Continuously reducing transistor sizes and aggressive low power operating modes employed by modern architectures tend to increase transient error rates. Concurrently, multicore machines are dominating the architectural spectrum today in various application domains. These two trends require a fresh look at resiliency of multithreaded applications against transient errors from a software perspective. In this paper, we propose and evaluate a new metric called the Thread Vulnerability Factor (TVF). A distinguishing characteristic of TVF is that its calculation for a given thread (which is typically one of the threads of a multithreaded application) does not depend on its code alone, but also on the codes of the threads that share resources and data with that thread. As a result, we decompose TVF of a thread into two complementary parts: local and remote. While the former captures the TVF induced by the code of the target thread, the latter represents the vulnerability impact of the threads that interact with the target thread. We quantify the local and remote TVF values for three architectural components (register file, ALUs, and caches) using a set of ten multithreaded applications from the Parsec and Splash-2 benchmark suites. Our experimental evaluation shows that TVF values tend to increase as the number of cores increases, which means the system becomes more vulnerable as the core count rises. We further discuss how TVF metric can be employed to explore performance-reliability tradeoffs in multicores. Reliability-based analysis of compiler optimizations and redundancy-based fault tolerance are also mentioned as potential usages of our TVF metric.
UR - http://www.scopus.com/inward/record.url?scp=84865062359&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865062359&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2012.05.002
DO - 10.1016/j.jpdc.2012.05.002
M3 - Article
AN - SCOPUS:84865062359
SN - 0743-7315
VL - 72
SP - 1171
EP - 1185
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
IS - 10
ER -