Robust alternatives to best linear unbiased prediction of complex traits

WHY BEST LINEAR UNBIASED PREDICTION EASY TO EXPLAIN FLEXIBLE AMENDABLE WELL UNDERSTOOD FEASIBLE UNPRETENTIOUS NORMALITY IS IMPLICIT

DRAWBACK: GAUSSIAN RESIDUALS SENSITIVE TO OUTLYING DATA POINTS Hampel et al. (1986) Rousseau and Leroy (1986) Lange et al. (1989) Seber and Lee (2003)

ACCOMODATING OUTLIERS DISCARD DATA WITH AD-HOC RULES FIT ROBUST RESIDUAL DISTRIBUTION ANIMAL BREEDERS HAVE DONE IT FOR INFERENCE, NOT PREDICTION! -STRANDEN AND GIANOLA (1998, 1999) -ROSA ET AL. (2003, 2004) -KIZILKAYA ET A. (2003) -CARDOSO ET AL. (2006 BAYESIAN MCMC USED (ADVANTAGES, DRAWBACKS, PITFALLS) -tricky parameters (df in the t-distribution) -Intensive computation -Involved convergence diagnostic -Monte Carlo error swamping statistical error -Not practical for routine industry application (explains why BLUP used, but consider BOLT)

OBJECTIVES PRESENT ROBUST ALTERNATIVES TO BLUP MODEL USES t OR LAPLACE (DOUBLE EXPONENTIAL) RESIDUAL DISTRIBUTIONS BAYESIAN NON-MCMC APPROACH EVALUATION WITH -wheat (grain yield) -Arabidopsis (plant diameter, gene expression, flowering time) -Brown Swiss cows (milk yield)

BLUP (BAYESIAN INTERPRETATION) Fixed or Flat prior Pedigree, genomic, similarity matrix Spread parameters Bayesian sampling model CONDITIONAL POSTERIOR MODE (GIVEN SPREAD)= CONDITIONAL POSTERIOR MEAN λ controls regularization MODEL COMPLEXITY

TMAP (Maximum a posteriori with t-residuals) Scale degrees of freedom Sampling model CONDITIONAL POSTERIOR DENSITY

LOCATE SOME MODE BY ITERATING WITH Diagonal matrix with elements scale parameter instead of residual variance here Weight changes iteratively -smaller when larger residuals -smaller at smaller ν and smaller scale -n 1if 1 observation per phenotype

LMAP (Maximum a posteriori with Laplace residuals) Sampling model LOG- CONDITIONAL POSTERIOR DENSITY

LOCATE SOME MODE BY ITERATING WITH Diagonal matrix with elements 2 Weight changes iteratively -smaller when larger residuals -n 1if 1 observation per phenotype

ZERO MEAN-MODEL (y=g+e) BLUP TMAP LMAP

PREDICTIVE ALGORITHM (e.g., TMAP)

CASE 1: BROWN SWISS TEST- DAY MILK YIELD n=991 cows, pre-corrected daily milk yield p= 37,568 SNP Grid of MINQUE(guesses): 0.05-0.95 (0.05 increments), followed by MINQUE (all cows) GBLUP, LMAP, TMAP (df= 4, 8, 12, 16) LMAP and TMAP iterated 300 times (overkill) Gianola and Schoen (2016) used to calculate LOO predictions indirectly, assuming constant variances Bootstrap (15,000 samples) emulated repeated sampling from joint distribution [predictands, LOO predictions]

FLAG OUTLIERS IN TMAP

COWS GOOD OR BAD BY GBLUP NOT THAT GOOD OR THAT BAD IN TMAP (4 OR 8)

Bootstrap distribution (b=15,000 samples) of predictive mean squared error (PMSE) and predictive correlation (PCOR) for GBLUP, TMAP (df=4) and LMAP at selected genomic heritability values (guesses of 0.05 and 0.50 produced MINQUE estimates of 0.07 and 0.15, respectively): test day milk yield in Brown Swiss cows. LMAP BEST FOLLOWED BY TMAP4 AND THEN BY GBLUP

CASE 2: WHEAT YIELD n=599 inbred lines Analyses for 4 different environments p= 1279 allelic markers (DaRT) Training (n=300)-testing (n=299) 200 random repetitions GBLUP and ABLUP [additive models] TMAP (df= 4, 6, 8) and LMAP [additive models] (200 iterations: overkill)

Distribution (200 replicates, training testing layout) of predictive mean squared error for BLUP (B), LMAP (L) and TMAP (4, 6, 8 df) for wheat yield in four environments. Genome (red) and pedigree based (blue) distributions

Distribution (200 replicates, training testing layout) of predictive correlation for BLUP (B), LMAP (L) and TMAP (4, 6, 8 df) for wheat yield in four environments. Genome and pedigree based distributions in red and blue

Frequency with which a given method had the largest predictive correlation over 200 replications: pedigree (A) based models, wheat ( winner in boldface). YIELD TRAIT ABLUP ALMAP ATMAP4 ATMAP6 ATMAP8 1 0.265 0.370 0.245 0.020 0.100 2 0.085 0.145 0.120 0.145 0.505 3 0.120 0.180 0.230 0.170 0.300 4 0.200 0.140 0.245 0.115 0.300 5 (1+2) 0.285 0.030 0.105 0.060 0.520 6 (1+3) 0.235 0.335 0.265 0.050 0.115 7 (1+4) 0.185 0.270 0.210 0.095 0.240 8 (2+3) 0.170 0.210 0.130 0.245 0.245 9 (2+4) 0.125 0.210 0.170 0.235 0.260 10 (3+4) 0.265 0.200 0.095 0.205 0.235 11 (1+2+3) 0.145 0.160 0.200 0.155 0.340 12 (1+2+4) 0.125 0.200 0.075 0.110 0.490 13 (1+3+4) 0.130 0.325 0.140 0.095 0.310 14 (2+3+4) 0.175 0.215 0.110 0.260 0.240 15 (1+2+3+4) 0.145 0.200 0.110 0.165 0.380

Frequency with which a given method had the largest predictive correlation over 200 replications: genome (G) based models, wheat ( winner in boldface) YIELD TRAIT GBLUP GLMAP GTMAP4 GTMAP6 GTMAP8 1 0.495 0.100 0.235 0.065 0.105 2 0.275 0.305 0.235 0.075 0.110 3 0.255 0.165 0.180 0.095 0.305 4 0.465 0.060 0.230 0.055 0.190 5 (1+2) 0.460 0.080 0.245 0.080 0.135 6 (1+3) 0.540 0.100 0.175 0.055 0.130 7 (1+4) 0.455 0.095 0.190 0.075 0.185 8 (2+3) 0.295 0.310 0.145 0.105 0.145 9 (2+4) 0.310 0.270 0.160 0.100 0.160 10 (3+4) 0.500 0.125 0.085 0.080 0.210 11 (1+2+3) 0.465 0.170 0.170 0.060 0.135 12 (1+2+4) 0.550 0.090 0.155 0.070 0.135 13 (1+3+4) 0.725 0.045 0.090 0.005 0.135 14 (2+3+4) 0.385 0.260 0.120 0.070 0.070 15 (1+2+3+4) 0.565 0.125 0.075 0.075 0.160

CASE 3: ARABIDOPSIS n=199 accessions (Atwell et al. 2010) Flowering time (n=194), plant diameter (n=180), FRIGIDA expression (n=164) p= 215,947 LOO with variances (MINQUE) re-estimated at each training instance GBLUP, TMAP (df=4, 8, 12, 16, 20), LMAP 50,000 bootstrap samples from [y, predictions] PMSE, PCOR, PREDICTIVE REGRESSION (ALPHA, BETA)

Bootstrap distribution (b=50,000 samples) of intercept (ALPHA) and slope (BETA) of regressions of predictands on predictors: flowering time, frigida expression and plant diameter in Arabidopsis

Bootstrap distribution (b=50,000 samples) of predictive mean squared error (PMSE) and predictive correlation (PCOR): flowering time, frigida expression and plant diameter in Arabidopsis

Table 1. Fraction of bootstrap samples (50,000) in which GBLUP) attained a smaller PMSE or a larger predictive PCOR than either LMAP or TMAP GBLUP vs LMAP GBLUP vs TMAP4 GBLUP vs TMAP8 GBLUP vs TMAP12 GBLUP vs TMAP16 GBLUP vs TMAP20 FLOW GBLUP UNIFORMLY WORSE PMSE 0.18 0 0 0 0.00 0.00 PCOR 0 0 0 0 0 0 FRIG GBLUP MOST OFTEN WORSE PMSE 0.55 0.53 0.33 0.35 0.36 0.37 PCOR 0.43 0.30 0.27 0.29 0.31 0.33 DIAM GBLUP UNIFORMLY BETTER PMSE 0.77 0.65 0.59 0.57 0.58 0.55 PCOR 0.78 0.81 0.82 0.81 0.80 0.80

CONCLUDING REMARKS The Bayesian alphabet goes environmental! BLUP WIDELY USED SIMPLE, UNDERSTOOD, FEASIBLE, FLEXIBLE EXTENSIVE SOFTWARE AVAILABLE DRAWBACK: NOT ROBUST TO OUTLIERS SIMPLE (GLIM-TYPE) METHODS PRESENTED FOR t AND LAPLACE RESIDUAL DISTRIBUTIONS EXTENDS EASILY TO ssblup AND RKHS

SKEWED RESIDUAL DISTRIBUTIONS

MULTIVARIATE OUTLIERS: UNCHARTED WATERS MULTIPLE-TRAIT t VERSION STRAIGHTFORWARD (STRANDÉN, 1996) MULTIVARIATE LAPLACE, NOT MUCH THEORY, BUT (GOMEZ et al. 1998) Power exponential family

CHINESE PHILOSOPHY One can have an army with millions of soldiers, but if their weapon is just a fork, a smaller and better equipped rival can be more effective in battle (Sun Tzu and Dan Gian, 6 th century BC)