Research Project Description
Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
Oak Ridge National Laboratory
Oak Ridge, TN
The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several high-performance computing (HPC) resilience technologies have been developed. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors:
performance, resilience, and power consumption. This project focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components.
The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.
Experience with hardware and/or software fault tolerance in computer systems, parallel discrete event simulation of computer systems, modeling of performance and power characteristics of computer systems.
How to Apply:
You must apply through the ORNL Talent and Opportunity System. Please note the deadline to apply for this posting is January 10, 2014.