Robust alternatives to best linear unbiased prediction of complex traits

Similar documents
A new Marker-Assisted BLUP genomic evaluation for French dairy breeds

A new Marker-Assisted BLUP genomic evaluation for French dairy breeds

Heritability Estimates for Conformation Traits Gladys Huapaya and Gerrit Kistemaker Canadian Dairy Network

Lecture 2. Review of Linear Regression I Statistics Statistical Methods II. Presented January 9, 2018

PUBLICATIONS Silvia Ferrari February 24, 2017

Sharif University of Technology. Graduate School of Management and Economics. Econometrics I. Fall Seyed Mahdi Barakchian

. Enter. Model Summary b. Std. Error. of the. Estimate. Change. a. Predictors: (Constant), Emphaty, reliability, Assurance, responsive, Tangible

Regression Models Course Project, 2016

Bayes Factors. Structural Equation Models (SEMs): Schwarz BIC and Other Approximations

HASIL OUTPUT SPSS. Reliability Scale: ALL VARIABLES

Antonio Olmos Priyalatha Govindasamy Research Methods & Statistics University of Denver

LECTURE 6: HETEROSKEDASTICITY

5. CONSTRUCTION OF THE WEIGHT-FOR-LENGTH AND WEIGHT-FOR- HEIGHT STANDARDS

The Incubation Period of Cholera: A Systematic Review Supplement. A. S. Azman, K. E. Rudolph, D.A.T. Cummings, J. Lessler

Multiple Imputation of Missing Blood Alcohol Concentration (BAC) Values in FARS

Some Robust and Classical Nonparametric Procedures of Estimations in Linear Regression Model

Pedigree updates and phenotypic data improvement

CONSTRUCT VALIDITY IN PARTIAL LEAST SQUARES PATH MODELING

Civil Engineering and Environmental, Gadjah Mada University TRIP ASSIGNMENT. Introduction to Transportation Planning

Stat 301 Lecture 30. Model Selection. Explanatory Variables. A Good Model. Response: Highway MPG Explanatory: 13 explanatory variables

From Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT. Full book available for purchase here.

Stat 401 B Lecture 31

Investigation in to the Application of PLS in MPC Schemes

Getting Started with Correlated Component Regression (CCR) in XLSTAT-CCR

Integrating remote sensing and ground monitoring data to improve estimation of PM 2.5 concentrations for chronic health studies

TRINITY COLLEGE DUBLIN THE UNIVERSITY OF DUBLIN. Faculty of Engineering, Mathematics and Science. School of Computer Science and Statistics

Appendices for: Statistical Power in Analyzing Interaction Effects: Questioning the Advantage of PLS with Product Indicators

Reliability of Hybrid Vehicle System

Regularized Linear Models in Stacked Generalization

Meeting product specifications

Signet Top Ram Lambs Report

Model Information Data Set. Response Variable (Events) Summe Response Variable (Trials) N Response Distribution Binomial Link Function

ME scope Application Note 29 FEA Model Updating of an Aluminum Plate

Motor Trend MPG Analysis

The Degrees of Freedom of Partial Least Squares Regression

Workshop on Frame Theory and Sparse Representation for Complex Data June 1, 2017

Chapter 5 ESTIMATION OF MAINTENANCE COST PER HOUR USING AGE REPLACEMENT COST MODEL

4th VALUE Training School Validation of Regional Climate Change Projections. Pitfalls II

Regression Analysis of Count Data

The Session.. Rosaria Silipo Phil Winters KNIME KNIME.com AG. All Right Reserved.

Data envelopment analysis with missing values: an approach using neural network

Improving CERs building

Identify Formula for Throughput with Multi-Variate Regression

TECHNICAL REPORTS from the ELECTRONICS GROUP at the UNIVERSITY of OTAGO. Table of Multiple Feedback Shift Registers

Factors Affecting Vehicle Use in Multiple-Vehicle Households

Breed Averages, Percentiles and Genetic Trends as of the week of August 21 st are shown below.

Preface... xi. A Word to the Practitioner... xi The Organization of the Book... xi Required Software... xii Accessing the Supplementary Content...

Road Surface characteristics and traffic accident rates on New Zealand s state highway network

Statistical Learning Examples

UJI VALIDITAS DAN RELIABILIAS VARIABEL KOMPENSASI

Accelerating the Development of Expandable Liner Hanger Systems using Abaqus

Draft Project Deliverables: Policy Implications and Technical Basis

LET S ARGUE: STUDENT WORK PAMELA RAWSON. Baxter Academy for Technology & Science Portland, rawsonmath.

9.3 Tests About a Population Mean (Day 1)

Motor Trend Yvette Winton September 1, 2016

2004 Iowa Experimental Corn Trials

The PRINCOMP Procedure

TABLE 4.1 POPULATION OF 100 VALUES 2

Wavelet-PLS Regression: Application to Oil Production Data

Index. Calculator, 56, 64, 69, 135, 353 Calendars, 348, 356, 357, 364, 371, 381 Card game, NEL Index

APPLICATION OF RELIABILITY GROWTH MODELS TO SENSOR SYSTEMS ABSTRACT NOTATIONS

Capacity-Achieving Accumulate-Repeat-Accumulate Codes for the BEC with Bounded Complexity

Example #1: One-Way Independent Groups Design. An example based on a study by Forster, Liberman and Friedman (2004) from the

Open Discussion Topic: Potential Pitfalls in the Use of Coefficient of Variation as a Measure of Trial Validity

Accelerating the Development of Expandable Liner Hanger Systems using Abaqus

Statistics and Quantitative Analysis U4320. Segment 8 Prof. Sharyn O Halloran

How to: Test & Evaluate Motors in Your Application

TRY OUT 25 Responden Variabel Kepuasan / x1

Effect of driving pattern parameters on fuel-economy for conventional and hybrid electric city buses

Predicting Solutions to the Optimal Power Flow Problem

LAMPIRAN 1. Lampiran Nama dan Kondisi Perusahaan Textile No Kode Nama Perusahaan Hasil z-score FD Non-FD

PARTIAL LEAST SQUARES: APPLICATION IN CLASSIFICATION AND MULTIVARIABLE PROCESS DYNAMICS IDENTIFICATION

Accuracy of imputed 50k genotypes from 3k and 6k chips using FImpute version 2

Analysis of Big Data Streams to Obtain Braking Reliability Information July 2013, for 2017 Train Protection 1 / 25

Investigation of Relationship between Fuel Economy and Owner Satisfaction

Effect of driving patterns on fuel-economy for diesel and hybrid electric city buses

London calling (probably)

QUALITY ASSURANCE & LAB ACCREDITATION

Performance of the Mean- and Variance-Adjusted ML χ 2 Test Statistic with and without Satterthwaite df Correction

Lampiran IV. Hasil Output SPSS Versi 16.0 untuk Analisis Deskriptif

Testing for seasonal unit roots in heterogeneous panels using monthly data in the presence of cross sectional dependence

Product Loss During Retail Motor Fuel Dispenser Inspection

TRY OUT 30 Responden Variabel Kompetensi/ x1

Capacity-Achieving Accumulate-Repeat-Accumulate Codes for the BEC with Bounded Complexity

PHEV Control Strategy Optimization Using MATLAB Distributed Computing: From Pattern to Tuning

Online Appendix for Subways, Strikes, and Slowdowns: The Impacts of Public Transit on Traffic Congestion

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Bayesian Trajectory Optimization for Magnetic Resonance Imaging Sequences

Voting Draft Standard

ACCIDENT MODIFICATION FACTORS FOR MEDIAN WIDTH

North Carolina End-of-Grade ELA/Reading Tests: Third and Fourth Edition Concordances

Assessing Feeder Hosting Capacity for Distributed Generation Integration

Published: 14 October 2014

Automated Driving - Object Perception at 120 KPH Chris Mansley

Crude Evaluation Best Practices

Improving the Quality and Production of Biogas from Swine Manure and Jatropha (Jatropha curcas) Seeds

COMPARING THE PREDICTIVE ABILITY OF PLS AND COVARIANCE MODELS

Correlation and Path Analysis of Palm Oil Yield Components in Oil Palm (Elaeis guineensis Jacq.)

Optimal Gasoline Blending

ASTM Standard for Hit/Miss POD Analysis

Transcription:

Robust alternatives to best linear unbiased prediction of complex traits

WHY BEST LINEAR UNBIASED PREDICTION EASY TO EXPLAIN FLEXIBLE AMENDABLE WELL UNDERSTOOD FEASIBLE UNPRETENTIOUS NORMALITY IS IMPLICIT

DRAWBACK: GAUSSIAN RESIDUALS SENSITIVE TO OUTLYING DATA POINTS Hampel et al. (1986) Rousseau and Leroy (1986) Lange et al. (1989) Seber and Lee (2003)

ACCOMODATING OUTLIERS DISCARD DATA WITH AD-HOC RULES FIT ROBUST RESIDUAL DISTRIBUTION ANIMAL BREEDERS HAVE DONE IT FOR INFERENCE, NOT PREDICTION! -STRANDEN AND GIANOLA (1998, 1999) -ROSA ET AL. (2003, 2004) -KIZILKAYA ET A. (2003) -CARDOSO ET AL. (2006 BAYESIAN MCMC USED (ADVANTAGES, DRAWBACKS, PITFALLS) -tricky parameters (df in the t-distribution) -Intensive computation -Involved convergence diagnostic -Monte Carlo error swamping statistical error -Not practical for routine industry application (explains why BLUP used, but consider BOLT)

OBJECTIVES PRESENT ROBUST ALTERNATIVES TO BLUP MODEL USES t OR LAPLACE (DOUBLE EXPONENTIAL) RESIDUAL DISTRIBUTIONS BAYESIAN NON-MCMC APPROACH EVALUATION WITH -wheat (grain yield) -Arabidopsis (plant diameter, gene expression, flowering time) -Brown Swiss cows (milk yield)

BLUP (BAYESIAN INTERPRETATION) Fixed or Flat prior Pedigree, genomic, similarity matrix Spread parameters Bayesian sampling model CONDITIONAL POSTERIOR MODE (GIVEN SPREAD)= CONDITIONAL POSTERIOR MEAN λ controls regularization MODEL COMPLEXITY

TMAP (Maximum a posteriori with t-residuals) Scale degrees of freedom Sampling model CONDITIONAL POSTERIOR DENSITY

LOCATE SOME MODE BY ITERATING WITH Diagonal matrix with elements scale parameter instead of residual variance here Weight changes iteratively -smaller when larger residuals -smaller at smaller ν and smaller scale -n 1if 1 observation per phenotype

LMAP (Maximum a posteriori with Laplace residuals) Sampling model LOG- CONDITIONAL POSTERIOR DENSITY

LOCATE SOME MODE BY ITERATING WITH Diagonal matrix with elements 2 Weight changes iteratively -smaller when larger residuals -n 1if 1 observation per phenotype

ZERO MEAN-MODEL (y=g+e) BLUP TMAP LMAP

PREDICTIVE ALGORITHM (e.g., TMAP)

CASE 1: BROWN SWISS TEST- DAY MILK YIELD n=991 cows, pre-corrected daily milk yield p= 37,568 SNP Grid of MINQUE(guesses): 0.05-0.95 (0.05 increments), followed by MINQUE (all cows) GBLUP, LMAP, TMAP (df= 4, 8, 12, 16) LMAP and TMAP iterated 300 times (overkill) Gianola and Schoen (2016) used to calculate LOO predictions indirectly, assuming constant variances Bootstrap (15,000 samples) emulated repeated sampling from joint distribution [predictands, LOO predictions]

FLAG OUTLIERS IN TMAP

COWS GOOD OR BAD BY GBLUP NOT THAT GOOD OR THAT BAD IN TMAP (4 OR 8)

Bootstrap distribution (b=15,000 samples) of predictive mean squared error (PMSE) and predictive correlation (PCOR) for GBLUP, TMAP (df=4) and LMAP at selected genomic heritability values (guesses of 0.05 and 0.50 produced MINQUE estimates of 0.07 and 0.15, respectively): test day milk yield in Brown Swiss cows. LMAP BEST FOLLOWED BY TMAP4 AND THEN BY GBLUP

CASE 2: WHEAT YIELD n=599 inbred lines Analyses for 4 different environments p= 1279 allelic markers (DaRT) Training (n=300)-testing (n=299) 200 random repetitions GBLUP and ABLUP [additive models] TMAP (df= 4, 6, 8) and LMAP [additive models] (200 iterations: overkill)

Distribution (200 replicates, training testing layout) of predictive mean squared error for BLUP (B), LMAP (L) and TMAP (4, 6, 8 df) for wheat yield in four environments. Genome (red) and pedigree based (blue) distributions

Distribution (200 replicates, training testing layout) of predictive correlation for BLUP (B), LMAP (L) and TMAP (4, 6, 8 df) for wheat yield in four environments. Genome and pedigree based distributions in red and blue

Frequency with which a given method had the largest predictive correlation over 200 replications: pedigree (A) based models, wheat ( winner in boldface). YIELD TRAIT ABLUP ALMAP ATMAP4 ATMAP6 ATMAP8 1 0.265 0.370 0.245 0.020 0.100 2 0.085 0.145 0.120 0.145 0.505 3 0.120 0.180 0.230 0.170 0.300 4 0.200 0.140 0.245 0.115 0.300 5 (1+2) 0.285 0.030 0.105 0.060 0.520 6 (1+3) 0.235 0.335 0.265 0.050 0.115 7 (1+4) 0.185 0.270 0.210 0.095 0.240 8 (2+3) 0.170 0.210 0.130 0.245 0.245 9 (2+4) 0.125 0.210 0.170 0.235 0.260 10 (3+4) 0.265 0.200 0.095 0.205 0.235 11 (1+2+3) 0.145 0.160 0.200 0.155 0.340 12 (1+2+4) 0.125 0.200 0.075 0.110 0.490 13 (1+3+4) 0.130 0.325 0.140 0.095 0.310 14 (2+3+4) 0.175 0.215 0.110 0.260 0.240 15 (1+2+3+4) 0.145 0.200 0.110 0.165 0.380

Frequency with which a given method had the largest predictive correlation over 200 replications: genome (G) based models, wheat ( winner in boldface) YIELD TRAIT GBLUP GLMAP GTMAP4 GTMAP6 GTMAP8 1 0.495 0.100 0.235 0.065 0.105 2 0.275 0.305 0.235 0.075 0.110 3 0.255 0.165 0.180 0.095 0.305 4 0.465 0.060 0.230 0.055 0.190 5 (1+2) 0.460 0.080 0.245 0.080 0.135 6 (1+3) 0.540 0.100 0.175 0.055 0.130 7 (1+4) 0.455 0.095 0.190 0.075 0.185 8 (2+3) 0.295 0.310 0.145 0.105 0.145 9 (2+4) 0.310 0.270 0.160 0.100 0.160 10 (3+4) 0.500 0.125 0.085 0.080 0.210 11 (1+2+3) 0.465 0.170 0.170 0.060 0.135 12 (1+2+4) 0.550 0.090 0.155 0.070 0.135 13 (1+3+4) 0.725 0.045 0.090 0.005 0.135 14 (2+3+4) 0.385 0.260 0.120 0.070 0.070 15 (1+2+3+4) 0.565 0.125 0.075 0.075 0.160

CASE 3: ARABIDOPSIS n=199 accessions (Atwell et al. 2010) Flowering time (n=194), plant diameter (n=180), FRIGIDA expression (n=164) p= 215,947 LOO with variances (MINQUE) re-estimated at each training instance GBLUP, TMAP (df=4, 8, 12, 16, 20), LMAP 50,000 bootstrap samples from [y, predictions] PMSE, PCOR, PREDICTIVE REGRESSION (ALPHA, BETA)

Bootstrap distribution (b=50,000 samples) of intercept (ALPHA) and slope (BETA) of regressions of predictands on predictors: flowering time, frigida expression and plant diameter in Arabidopsis

Bootstrap distribution (b=50,000 samples) of predictive mean squared error (PMSE) and predictive correlation (PCOR): flowering time, frigida expression and plant diameter in Arabidopsis

Table 1. Fraction of bootstrap samples (50,000) in which GBLUP) attained a smaller PMSE or a larger predictive PCOR than either LMAP or TMAP GBLUP vs LMAP GBLUP vs TMAP4 GBLUP vs TMAP8 GBLUP vs TMAP12 GBLUP vs TMAP16 GBLUP vs TMAP20 FLOW GBLUP UNIFORMLY WORSE PMSE 0.18 0 0 0 0.00 0.00 PCOR 0 0 0 0 0 0 FRIG GBLUP MOST OFTEN WORSE PMSE 0.55 0.53 0.33 0.35 0.36 0.37 PCOR 0.43 0.30 0.27 0.29 0.31 0.33 DIAM GBLUP UNIFORMLY BETTER PMSE 0.77 0.65 0.59 0.57 0.58 0.55 PCOR 0.78 0.81 0.82 0.81 0.80 0.80

CONCLUDING REMARKS The Bayesian alphabet goes environmental! BLUP WIDELY USED SIMPLE, UNDERSTOOD, FEASIBLE, FLEXIBLE EXTENSIVE SOFTWARE AVAILABLE DRAWBACK: NOT ROBUST TO OUTLIERS SIMPLE (GLIM-TYPE) METHODS PRESENTED FOR t AND LAPLACE RESIDUAL DISTRIBUTIONS EXTENDS EASILY TO ssblup AND RKHS

SKEWED RESIDUAL DISTRIBUTIONS

MULTIVARIATE OUTLIERS: UNCHARTED WATERS MULTIPLE-TRAIT t VERSION STRAIGHTFORWARD (STRANDÉN, 1996) MULTIVARIATE LAPLACE, NOT MUCH THEORY, BUT (GOMEZ et al. 1998) Power exponential family

CHINESE PHILOSOPHY One can have an army with millions of soldiers, but if their weapon is just a fork, a smaller and better equipped rival can be more effective in battle (Sun Tzu and Dan Gian, 6 th century BC)