A Predictive Fault Avoidance Scheme for Coarse Grained Reconfigurable Architecture Toshihiro Kameda 1 Hiroaki Konoura 1 Dawood Alnajjar 1 Yukio Mitsuyama 2 Masanori Hashimoto 1 Takao Onoye 1 hasimoto@ist.osaka u.ac.jp 1 Osaka University & JST, CREST 2 Kochi University of Technology & JST, CREST 1
Background Aging effects becoming significant Larger delay margin, lower performance For coping with aging induced delay increase Suppressing aging effects Eliminating faulty modules <= focus of this work Tests for reconfigurable device Manufacturer test: Individual BEs (basic elements) on a chip satisfy specification User test: Mapped circuits satisfy specification Speed requirement of BEs on non critical paths can be relaxed. 2
Objective and Contribution Proposes a scheme for identifying a pair of faulty BE and healthy BE to avoid setup delay faults Added a small circuit for delay fault prediction Experimentally verified how much slack is necessary to ensure fault prediction 3
Requirements User test is good enough. Manufacturing test is over testing and shortens life time. faults must be predicted before happening. Error recovery is expensive. Testing needs to guide faulty BE elimination. Faulty BE identification is not enough. Clock manipulation is not allowed. Other circuits should continue to work during faulty BE elimination. 4
Proposed Fault Avoidance Procedure NO NO Start Path selection Slack assessment Slack too small? YES Identify a pair of BEs w/ slack assessment BE replacement tests for paths including replaced BE All tests passed? YES Standby time ends? YES End NO in Standby Mode Select a path and assess its path slack If slack is smaller than threshold, do BE replacement: Identify a pair of BEs maximizing the path slack, and replace them. Test all the paths going through the replaced BE Continue in standby mode 5
Slack Assessment w/ Selectable Selectable delay Normal path Tunable delay Path for slack assessment Ex. = 100ps = 200ps 100 ps < slack < 200 ps Normal operation: upper path is selected. Slack assessment: lower path is selected. testing w/ various delays gives slack range. 6
Slack Assessment in Reconfigurable Device Selectable delay is inserted in front of pipeline registers. Cycle time: 10ns TPG BE1 Reg TPG BE1 Reg TPG: test pattern generator RA: response analyzer Path test w/ 200ps delay fails. Slack < 200ps 0ps 200ps BE2 BE4 RA 4.9ns BE3 2.5ns Reg 2.5ns BE2 replaced 800ps BE2 BE4 RA 4.9ns 2.5ns BE3 Reg 2.5ns Predictive Fault Avoidance 4.1ns BE2 0ps Path test w/ 800ps delay succeeds. Slack > 800ps 7
Assumed Architecture BE #bits for config per BE: 101 #gates for 4x4 array (65nm): 114,421 West Dout1 Dout2 Fout Din1&2 of N, E, S, W Din1&2 of N, E, S, W Din1 Din2 Fin South North 16bit EM Fout MUX Fout Fin MUX AREG FREG Din1&2 of N,E,W +EM Dout YREG x1 x2,3,4 Dout1 Dout2 Din2 ALU S EM Dout Din1 x1,2,3,4 Dout1 Din1 MUX BREG Din2 MUX Dout2 Fin of N, E, S, W Fin Fin Din2 Din1 Fout East Fout Dout2 Dout1 8
Applications for Experiments FIR filter Cycle time: 8,209ps (10% longer than critical path) #paths: 1761 Time needed for testing: 28ms FFT Cycle time: 5,940 (10% longer) #paths: 440 Time needed for testing: 7ms (Clock for config: 1bit, 10MHz) 9
Timing error happens when delay increase between tests is larger than threshold. Larger threshold reduces error occurrence, but many paths need to be replaced wastefully. Selectable = 10ps #Paths for Assessment = 150 Paths #1 to Paths #101 to #100 tested #140 tested Error occurred Paths #141 to #150 & #1 to #70 tested Active Standby time Path #1 tested Slack 11ps increases by >11ps. Next test of #1 scheduled. 10
Another Timing Error Situation #paths for slack assessment is reduced For reducing memory storing config data A path not included for slack assessment becomes critical Due to manufacturing variability Memory saving could trade success probability of fault prediction 11
Evaluation Metric & Setup Metric: Success probability No timing errors happen in 10 years Setup Parameters to change Threshold slack Lengths of active time and standby time (temporally fluctuating w/ std. dev. is 30% of each average) #paths for slack assessment Manufacturing variability Std. dev. of gate delay: 5% of average 1,000 devices are virtually fabricated. Aging 30% delay increase in 10 years 12
Success Probability vs. Threshold Slack (FIR filter, avg. active time 1hour, full paths) Success Probability [%] 100 80 60 1s 0.1s 40 Avg. Standby Time 20 0.01s increase in 1hour: 0.00034% smaller than 0.001% (1ps) 0 0.0001 0.001 0.01 0.1 Threshold Slack [%] As threshold slack increases, success prob. increases to 100%. Avg. standby time affects success prob. 13
Success Probability vs. Threshold Slack (FFT, avg. active time 1hour, full paths) Success Probability [%] 100 1s 80 60 40 20 Time needed for all paths 7ms <<0.1 s 0.1s 0.01s Average Standby Time 0 0.0001 0.001 0.01 0.1 Threshold Slack [%] 14
Success Probability vs. Threshold Slack (FIR filter, avg. active time 1hour) As #paths for slack assessment decreases, Possibility that other paths cause errors increases Time necessary for testing all paths becomes shorter Success Probability [%] 100 0.1s 1s 80 Average 60 Standby Time 0.01s 40 20 10 100 1000 10000 # of Paths for Slack Assessment Reducing #paths for slack assessment improved success probability 15
Conclusions Proposed a scheme for avoiding delay faults in coarse grained reconfigurable device Predicts timing faults before errors happen w/ slack assessment Guides BE replacement without causing new timing faults due to replacement Experiments show small threshold slack <1ps is enough for fault prediction 16