A Unified Regularized Group PLS Algorithm Scalable to Big Data

Similar documents
Preface... xi. A Word to the Practitioner... xi The Organization of the Book... xi Required Software... xii Accessing the Supplementary Content...

The Degrees of Freedom of Partial Least Squares Regression

Data envelopment analysis with missing values: an approach using neural network

Lecture 2. Review of Linear Regression I Statistics Statistical Methods II. Presented January 9, 2018

Investigation in to the Application of PLS in MPC Schemes

Statistical Learning Examples

PLS score-loading correspondence and a bi-orthogonal factorization

PARTIAL LEAST SQUARES: WHEN ORDINARY LEAST SQUARES REGRESSION JUST WON T WORK

Robust alternatives to best linear unbiased prediction of complex traits

Rule-based Integration of Multiple Neural Networks Evolved Based on Cellular Automata

Investigating the Concordance Relationship Between the HSA Cut Scores and the PARCC Cut Scores Using the 2016 PARCC Test Data

Cost-Efficiency by Arash Method in DEA

Workshop on Frame Theory and Sparse Representation for Complex Data June 1, 2017

Efficiency Measurement on Banking Sector in Bangladesh

VOLTAGE STABILITY CONSTRAINED ATC COMPUTATIONS IN DEREGULATED POWER SYSTEM USING NOVEL TECHNIQUE

Differential Evolution Algorithm for Gear Ratio Optimization of Vehicles

Regression Analysis of Count Data

Antonio Olmos Priyalatha Govindasamy Research Methods & Statistics University of Denver

CHAPTER 3 PROBLEM DEFINITION

Predicting Solutions to the Optimal Power Flow Problem

IMA Preprint Series # 2035

An Introduction to Partial Least Squares Regression

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Integrating remote sensing and ground monitoring data to improve estimation of PM 2.5 concentrations for chronic health studies

Statistics and Quantitative Analysis U4320. Segment 8 Prof. Sharyn O Halloran

Improving CERs building

Domain-invariant Partial Least Squares (di-pls) Regression: A novel method for unsupervised and semi-supervised calibration model adaptation

Getting Started with Correlated Component Regression (CCR) in XLSTAT-CCR

AGENT-BASED MODELING, SIMULATION, AND CONTROL SOME APPLICATIONS IN TRANSPORTATION

A REPORT ON THE STATISTICAL CHARACTERISTICS of the Highlands Ability Battery CD

PUBLICATIONS Silvia Ferrari February 24, 2017

Optimal Power Flow Formulation in Market of Retail Wheeling

Tire-Road Forces Estimation Using Extended Kalman Filter and Sideslip Angle Evaluation

Complex Power Flow and Loss Calculation for Transmission System Nilam H. Patel 1 A.G.Patel 2 Jay Thakar 3

OPTIMIZATION STUDIES OF ENGINE FRICTION EUROPEAN GT CONFERENCE FRANKFURT/MAIN, OCTOBER 8TH, 2018

Using Statistics To Make Inferences 6. Wilcoxon Matched Pairs Signed Ranks Test. Wilcoxon Rank Sum Test/ Mann-Whitney Test

Wavelet-PLS Regression: Application to Oil Production Data

CHAPTER I INTRODUCTION

Voting Draft Standard

Analysis on natural characteristics of four-stage main transmission system in three-engine helicopter

Regularized Linear Models in Stacked Generalization

Topic 5 Lecture 3 Estimating Policy Effects via the Simple Linear. Regression Model (SLRM) and the Ordinary Least Squares (OLS) Method

The DPM Detector. Code:

Steady-State Power System Security Analysis with PowerWorld Simulator

A Personalized Highway Driving Assistance System

Linking the Florida Standards Assessments (FSA) to NWEA MAP

Inventory systems for dependent demand

PARTIAL LEAST SQUARES: APPLICATION IN CLASSIFICATION AND MULTIVARIABLE PROCESS DYNAMICS IDENTIFICATION

Passenger density and flow analysis and city zones and bus stops classification for public bus service management

ACCIDENT MODIFICATION FACTORS FOR MEDIAN WIDTH

Linking the Alaska AMP Assessments to NWEA MAP Tests

Bayesian Trajectory Optimization for Magnetic Resonance Imaging Sequences

Suburban bus route design

Capacity-Achieving Accumulate-Repeat-Accumulate Codes for the BEC with Bounded Complexity

Online Appendix for Subways, Strikes, and Slowdowns: The Impacts of Public Transit on Traffic Congestion

Linking the Mississippi Assessment Program to NWEA MAP Tests

Discovery of Design Methodologies. Integration. Multi-disciplinary Design Problems

Online Learning and Optimization for Smart Power Grid

Statistical Estimation Model for Product Quality of Petroleum

Computer Aided Transient Stability Analysis

Linking the Virginia SOL Assessments to NWEA MAP Growth Tests *

Linking the Georgia Milestones Assessments to NWEA MAP Growth Tests *

Published: 14 October 2014

Analyzing Crash Risk Using Automatic Traffic Recorder Speed Data

Smart Operation for AC Distribution Infrastructure Involving Hybrid Renewable Energy Sources

Example #1: One-Way Independent Groups Design. An example based on a study by Forster, Liberman and Friedman (2004) from the

Linking the North Carolina EOG Assessments to NWEA MAP Growth Tests *

Linking the Indiana ISTEP+ Assessments to NWEA MAP Tests

9.2 User s Guide SAS/STAT. The PLS Procedure. (Book Excerpt) SAS Documentation

Linking the Kansas KAP Assessments to NWEA MAP Growth Tests *

The Brake Assist System

Rotorcraft Gearbox Foundation Design by a Network of Optimizations

Linking the New York State NYSTP Assessments to NWEA MAP Growth Tests *

Benchmarking Inefficient Decision Making Units in DEA

A UNIFYING VIEW ON MULTI-STEP FORECASTING USING AN AUTOREGRESSION

Intelligent Fault Analysis in Electrical Power Grids

Propeller Blade Bearings for Aircraft Open Rotor Engine

WHITE PAPER. Preventing Collisions and Reducing Fleet Costs While Using the Zendrive Dashboard

Sharif University of Technology. Graduate School of Management and Economics. Econometrics I. Fall Seyed Mahdi Barakchian

IRT Models for Polytomous Response Data

From Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT. Full book available for purchase here.

Risk factors, driver behaviour and accident probability. The case of distracted driving.

REDUCING THE OCCURRENCES AND IMPACT OF FREIGHT TRAIN DERAILMENTS

CHAPTER 1 INTRODUCTION

Supplementary file related to the paper titled On the Design and Deployment of RFID Assisted Navigation Systems for VANET

The Assist Curve Design for Electric Power Steering System Qinghe Liu1, a, Weiguang Kong2, b and Tao Li3, c

Supervised Learning to Predict Human Driver Merging Behavior

Student-Level Growth Estimates for the SAT Suite of Assessments

An Integrated Process for FDIR Design in Aerospace

A First Principles-based Li-Ion Battery Performance and Life Prediction Model Based on Reformulated Model Equations NASA Battery Workshop

Experimental Study on Torsional Vibration of Transmission System Under Engine Excitation Xin YANG*, Tie-shan ZHANG and Nan-lin LEI

STUDY OF AIRBAG EFFECTIVENESS IN HIGH SEVERITY FRONTAL CRASHES

2010 Journal of Industrial Ecology

SPEED AND TORQUE CONTROL OF AN INDUCTION MOTOR WITH ANN BASED DTC

POWER FLOW SIMULATION AND ANALYSIS

Adaptive Routing and Recharging Policies for Electric Vehicles

Second Generation of Pollutant Emission Models for SUMO

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

8. Supplementary Material

Rotor Position Detection of CPPM Belt Starter Generator with Trapezoidal Back EMF using Six Hall Sensors

Transcription:

A Unified Regularized Group PLS Algorithm Scalable to Big Data Pierre Lafaye de Micheaux 1, Benoit Liquet 2, Matthew Sutton 3 21 October, 2016 1 CREST, ENSAI. 2 Université de Pau et des Pays de l Adour, LMAP. 3 Queensland Uninversity of Technology, Brisbane, Australia. Big Data PLS Methods JSTAR 2016, Rennes 1/54

Contents 1. Motivation: Integrative Analysis for group data 2. Application on a HIV vaccine study 3. PLS approaches: SVD, PLS-W2A, canonical, regression 4. Sparse Models Lasso penalty Group penalty Group and Sparse Group PLS 5. R package: sgpls 6. Regularized PLS Scalable to BIG-DATA 7. Concluding remarks Big Data PLS Methods JSTAR 2016, Rennes 2/54

Integrative Analysis Wikipedia. Data integration involves combining data residing in different sources and providing users with a unified view of these data. This process becomes significant in a variety of situations, which include both commercial and scientific domains. System Biology. Integrative Analysis: Analysis of heterogeneous types of data from inter-platform technologies. Goal. Combine multiple types of data: Contribute to a better understanding of biological mechanisms. Have the potential to improve the diagnosis and treatments of complex diseases. Big Data PLS Methods JSTAR 2016, Rennes 3/54

Example: Data definition p q n X - n observations - p variables Y - n observations - q variables n Big Data PLS Methods JSTAR 2016, Rennes 4/54

Example: Data definition p q n X - n observations - p variables Y - n observations - q variables n Omics. Y matrix: gene expression, X matrix: SNP (single nucleotide polymorphism). Many others such as proteomic, metabolomic data. Big Data PLS Methods JSTAR 2016, Rennes 4/54

Example: Data definition p q n X - n observations - p variables Y - n observations - q variables n Omics. Y matrix: gene expression, X matrix: SNP (single nucleotide polymorphism). Many others such as proteomic, metabolomic data. Neuroimaging. Y matrix: behavioral variables, X matrix: brain activity (e.g., EEG, fmri, NIRS) Big Data PLS Methods JSTAR 2016, Rennes 4/54

Example: Data definition p q n X - n observations - p variables Y - n observations - q variables n Omics. Y matrix: gene expression, X matrix: SNP (single nucleotide polymorphism). Many others such as proteomic, metabolomic data. Neuroimaging. Y matrix: behavioral variables, X matrix: brain activity (e.g., EEG, fmri, NIRS) Neuroimaging Genetics. Y matrix: DTI (Diffusion Tensor Imaging), X matrix: SNP Big Data PLS Methods JSTAR 2016, Rennes 4/54

Data: Constraints and Aims Main constraint: colinearity among the variables, or situation with p > n or q > n. But p and q are supposed to be not too large. Big Data PLS Methods JSTAR 2016, Rennes 5/54

Data: Constraints and Aims Main constraint: colinearity among the variables, or situation with p > n or q > n. But p and q are supposed to be not too large. Two Aims: 1. Symmetric situation. Analyze the association between two blocks of information. Analysis focused on shared information. Big Data PLS Methods JSTAR 2016, Rennes 5/54

Data: Constraints and Aims Main constraint: colinearity among the variables, or situation with p > n or q > n. But p and q are supposed to be not too large. Two Aims: 1. Symmetric situation. Analyze the association between two blocks of information. Analysis focused on shared information. 2. Asymmetric situation. X matrix= predictors and Y matrix= response variables. Analysis focused on prediction. Big Data PLS Methods JSTAR 2016, Rennes 5/54

Data: Constraints and Aims Main constraint: colinearity among the variables, or situation with p > n or q > n. But p and q are supposed to be not too large. Two Aims: 1. Symmetric situation. Analyze the association between two blocks of information. Analysis focused on shared information. 2. Asymmetric situation. X matrix= predictors and Y matrix= response variables. Analysis focused on prediction. Partial Least Square Family: dimension reduction approaches Big Data PLS Methods JSTAR 2016, Rennes 5/54

Data: Constraints and Aims Main constraint: colinearity among the variables, or situation with p > n or q > n. But p and q are supposed to be not too large. Two Aims: 1. Symmetric situation. Analyze the association between two blocks of information. Analysis focused on shared information. 2. Asymmetric situation. X matrix= predictors and Y matrix= response variables. Analysis focused on prediction. Partial Least Square Family: dimension reduction approaches PLS finds pairs of latent vectors ξ = Xu, ω = Yv with maximal covariance. e.g., ξ = u 1 SNP 1 + u 2 SNP 2 + + u p SNP p Symmetric situation and Asymmetric situation. Matrix decomposition of X and Y into successive latent variables. Latent variables: are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Capture an underlying phenomenon (e.g., health). Big Data PLS Methods JSTAR 2016, Rennes 5/54

PLS and sparse PLS Classical PLS Output of PLS: H pairs of latent variables (ξ h, ω h ), h = 1,..., H. Reduction method (H << min(p, q)). But no variable selection for extracting the most relevant (original) variables from each latent variable. Big Data PLS Methods JSTAR 2016, Rennes 6/54

PLS and sparse PLS Classical PLS Output of PLS: H pairs of latent variables (ξ h, ω h ), h = 1,..., H. Reduction method (H << min(p, q)). But no variable selection for extracting the most relevant (original) variables from each latent variable. sparse PLS sparse PLS selects the relevant SNPs Some coefficients u l are equal to 0 ξ h = u 1 SNP 1 + u }{{} 2 SNP 2 + u 3 SNP }{{} 3 + + u p SNP p =0 =0 The spls components are linear combinations of the selected variables Big Data PLS Methods JSTAR 2016, Rennes 6/54

Group structures within the data Natural example: Categorical variables form a group of dummy variables in a regression setting. Big Data PLS Methods JSTAR 2016, Rennes 7/54

Group structures within the data Natural example: Categorical variables form a group of dummy variables in a regression setting. Genomics: genes within the same pathway have similar functions and act together in regulating a biological system. These genes can add up to have a larger effect can be detected as a group (i.e., at a pathway or gene set/module level). Big Data PLS Methods JSTAR 2016, Rennes 7/54

Group structures within the data Natural example: Categorical variables form a group of dummy variables in a regression setting. Genomics: genes within the same pathway have similar functions and act together in regulating a biological system. These genes can add up to have a larger effect can be detected as a group (i.e., at a pathway or gene set/module level). We consider that variables are divided into groups: Example: p SNPs grouped into K genes (X j = SNP j ) [ X = SNP 1,..., SNP k } {{ } gene 1 SNP k+1, SNP k+2,..., SNP } {{ } h gene 2... SNP l+1,..., SNP p } {{ } gene K Example: p genes grouped into K pathways/modules (X j = gene j ) [ ] X = X 1, X 2,..., X } {{ } k X k+1, X k+2,..., X h... X } {{ } l+1, X l+2,..., X p } {{ } M 1 M 2 M K ] Big Data PLS Methods JSTAR 2016, Rennes 7/54

Group PLS Aim: select groups of variables taking into account the data structure Big Data PLS Methods JSTAR 2016, Rennes 8/54

Group PLS Aim: select groups of variables taking into account the data structure PLS components ξ h = u 1 X 1 + u 2 X 2 + u 3 X 3 + + u p X p sparse PLS components (spls) ξ h = u 1 X 1 + u }{{} 2 X 2 + u 3 X }{{} 3 + + u p X p =0 =0 Big Data PLS Methods JSTAR 2016, Rennes 8/54

Group PLS Aim: select groups of variables taking into account the data structure PLS components ξ h = u 1 X 1 + u 2 X 2 + u 3 X 3 + + u p X p sparse PLS components (spls) ξ h = u 1 X 1 + u }{{} 2 X 2 + u 3 X }{{} 3 + + u p X p =0 =0 group PLS components (gpls) module 1 module 2 module K { }} {{ }} {{ }} { ξ h = u }{{} 1 X 1 + u }{{} 2 X 2 + u }{{} 3 X 3 + u }{{} 4 X 1 + u }{{} 5 X 5 + + u p 1 X p 1 + u p X p }{{}}{{} =0 =0 0 0 0 =0 =0 select groups of variables; either all the variables within a group are selected or none of them are selected Big Data PLS Methods JSTAR 2016, Rennes 8/54

Group PLS Aim: select groups of variables taking into account the data structure PLS components ξ h = u 1 X 1 + u 2 X 2 + u 3 X 3 + + u p X p sparse PLS components (spls) ξ h = u 1 X 1 + u }{{} 2 X 2 + u 3 X }{{} 3 + + u p X p =0 =0 group PLS components (gpls) module 1 module 2 module K { }} {{ }} {{ }} { ξ h = u }{{} 1 X 1 + u }{{} 2 X 2 + u }{{} 3 X 3 + u }{{} 4 X 1 + u }{{} 5 X 5 + + u p 1 X p 1 + u p X p }{{}}{{} =0 =0 0 0 0 =0 =0 select groups of variables; either all the variables within a group are selected or none of them are selected... does not achieve sparsity within each group... Big Data PLS Methods JSTAR 2016, Rennes 8/54

Sparse Group PLS Aim: combine both sparsity of groups and within each group. Example: X matrix = genes. We might be interested in identifying particularly important genes in pathways of interest. sparse PLS components (spls) ξ h = u 1 X 1 + u }{{} 2 X 2 + u 3 X }{{} 3 + + u p X p =0 =0 group PLS components (gpls) module 1 module 2 module K { }} {{ }} {{ }} { ξ h = u }{{} 1 X 1 + u }{{} 2 X 2 + u }{{} 3 X 3 + u }{{} 4 X 1 + u }{{} 5 X 5 + + u p 1 X p 1 + u p X p }{{}}{{} =0 =0 0 0 0 =0 =0 Big Data PLS Methods JSTAR 2016, Rennes 9/54

Sparse Group PLS Aim: combine both sparsity of groups and within each group. Example: X matrix = genes. We might be interested in identifying particularly important genes in pathways of interest. sparse PLS components (spls) ξ h = u 1 X 1 + u }{{} 2 X 2 + u 3 X }{{} 3 + + u p X p =0 =0 group PLS components (gpls) module 1 module 2 module K { }} {{ }} {{ }} { ξ h = u }{{} 1 X 1 + u }{{} 2 X 2 + u }{{} 3 X 3 + u }{{} 4 X 1 + u }{{} 5 X 5 + + u p 1 X p 1 + u p X p }{{}}{{} =0 =0 0 0 0 =0 =0 sparse group PLS components (sgpls) module 1 module 2 module K { }} {{ }} {{ }} { ξ h = u }{{} 1 X 1 + u }{{} 2 X 2 + u }{{} 3 X 3 + u }{{} 4 X 4 + u }{{} 5 X 5 + + u p 1 X p 1 + u p X p }{{}}{{} =0 =0 0 =0 =0 =0 =0 Big Data PLS Methods JSTAR 2016, Rennes 9/54

Aims in a regression setting Select groups of variables taking into account the data structure; all the variables within a group are selected otherwise none of them are selected Combine both sparsity of groups and within each group; only relevant variables within a group are selected Big Data PLS Methods JSTAR 2016, Rennes 10/54

Illustration: Dendritic Cells in Addition to Antiretroviral Treatment (DALIA) trial Evaluation of the safety and the immunogenicity of a vaccine on n = 19 HIV-1 infected patients. The vaccine was injected on weeks 0, 4, 8 and 12 while patients received an antiretroviral therapy. An interruption of the antiretrovirals was performed at week 24. After vaccination, a deep evaluation of the immune response was performed at week 16. Repeated measurements of the main immune markers and gene expression were performed every 4 weeks until the end of the trials. Big Data PLS Methods JSTAR 2016, Rennes 11/54

DALIA trial: Question? First results obtained using group of genes Significant change of gene expression among 69 modules over time before antiretroviral treatment interruption. Big Data PLS Methods JSTAR 2016, Rennes 12/54

DALIA trial: Question? First results obtained using group of genes Significant change of gene expression among 69 modules over time before antiretroviral treatment interruption. How does the gene abundance of these 69 modules as measured at week 16 correlate with immune markers measured at week 16? Big Data PLS Methods JSTAR 2016, Rennes 12/54

spls, gpls and sgpls Response variables Y= immune markers composed of q = 7 cytokines (IL21, IL2, IL13, IFNg, Luminex score, TH1 score, CD4). Predictor variables X= expression of p = 5399 genes extracted from the 69 modules. Use the structure of the data (modules) for gpls and sgpls. Each gene belongs to one of the 69 modules. Asymmetric situation. Big Data PLS Methods JSTAR 2016, Rennes 13/54

Results: Modules and number of genes selected p = 5399 ; 24 modules selected by gpls or sgpls on 3 scores Big Data PLS Methods JSTAR 2016, Rennes 14/54

Results: Modules and number of genes selected Big Data PLS Methods JSTAR 2016, Rennes 15/54

Results: Venn diagram Big Data PLS Methods JSTAR 2016, Rennes 16/54

Results: Venn diagram sgpls selects slightly more genes than spls (respectively 487 and 420 genes selected) But sgpls selects fewer modules than spls (respectively 21 and 64 groups of genes selected) Note: all the 21 groups of genes selected by sgpls were included in those selected by spls. sgpls selects slightly more modules than gpls (4 more, 14/21 in common). Big Data PLS Methods JSTAR 2016, Rennes 16/54

Results: Venn diagram sgpls selects slightly more genes than spls (respectively 487 and 420 genes selected) But sgpls selects fewer modules than spls (respectively 21 and 64 groups of genes selected) Note: all the 21 groups of genes selected by sgpls were included in those selected by spls. sgpls selects slightly more modules than gpls (4 more, 14/21 in common). However, gpls leads to more genes selected than sgpls (944) In this application, the sgpls approach led to a parsimonious selection of modules and genes that sound very relevant biologically Chaussabel s functional modules: http://www.biir.net/public wikis/module annotation/v2 Trial 8 Modules Big Data PLS Methods JSTAR 2016, Rennes 16/54

M7.1 M5.5 M5.1 M3.5 M5.7 M7.5 M4.7 M5.4 M7.4 M6.2 M6.7 M5.9 M5.6 M7.8 M4.9 M1.1 M7.6 M4.6 M6.4 M7.2 M5.3 M5.8 M6.9 M3.2 M2.1 M3.1 M4.1 M4.2 M6.6 M4.8 M3.6 M7.7 M5.2 M4.5 M4.4 M2.3 M4.3 M4.2 M6.6 M3.2 M7.1 M3.6 M4.6 M1.1 M4.8 M2.3 M3.1 M4.4 M5.1 M4.7 M4.1 M2.1 M5.3 M7.8 M6.9 M5.5 M4.3 M6.7 M6.4 M5.7 M5.9 M3.5 M6.2 M7.2 M7.6 M5.8 M4.5 M4.9 M5.2 M7.5 M7.7 M5.4 M7.4 M5.6 Stability of the variable selection (100 bootstrap samples) gpls component 1 1.0 Selected Not Selected 1.0 Selected Not Selected 0.8 0.6 Frequency M7.1 M5.1 M3.2 M4.2 M4.6 M6.6 M5.7 M1.1 M3.6 M3.5 M4.1 M5.5 M5.3 M4.7 M2.3 M3.1 M7.2 M4.4 M5.4 M5.8 M7.5 M7.8 M6.4 M6.7 M7.6 M4.3 M5.6 M5.9 M7.7 M4.8 M4.9 M6.2 M2.1 M7.4 M4.5 M5.2 M6.9 sgpls component 1 0.8 0.6 Frequency 0.4 0.4 0.2 0.2 0.0 0.0 M5.15 M7.27 M4.13 M4.15 M6.13 M7.35 M5.14 M6.14 M6.10 M5.11 M7.33 M8.14 M4.14 M6.20 M4.11 M8.59 M4.12 M7.14 Module M7.15 M4.16 M7.24 M7.25 M8.13 M8.35 M7.16 M5.10 M7.11 M5.13 M7.12 M7.21 M7.26 M7.28 M5.15 M4.13 M4.15 M7.27 M6.13 M5.14 M5.11 M6.14 M7.35 M7.12 M7.15 M4.16 M7.25 M6.10 Module M7.11 M7.33 M8.14 M5.10 M7.16 M4.12 M4.14 M7.24 M6.20 M7.21 M5.13 M7.14 M7.26 M7.28 M8.35 M8.13 M4.11 M8.59 spls component 1 1.0 Selected Not Selected Apoptosis / Survival Apoptosis / Survival Cell Cycle 0.8 Cell Death Cytotoxic/NK Cell Erythrocytes 0.6 Inflammation Frequency 0.4 Mitochondrial Respiration Mitochondrial Stress / Proteasome Monocytes Neutrophils Plasma Cells Platelets 0.2 Protein Synthesis T cell T cells 0.0 Undetermined M5.13 M5.10 M7.12 M7.11 M7.25 M7.21 M7.16 M6.20 M6.10 M5.11 M4.12 M4.16 M7.15 M4.15 Module M7.14 M6.14 M7.26 M7.24 M6.13 M4.13 M5.14 M7.27 M8.14 M7.35 M8.13 M5.15 M7.28 M7.33 M4.14 M8.59 M8.35 M4.11 Stability of the variable selection assessed on 100 bootstrap samples on DALIA-1 trial data, for the gpls, sgpls and spls procedures respectively. For each procedure, the modules selected on the original sample are separated from those that were not. Big Data PLS Methods JSTAR 2016, Rennes 17/54

Now some mathematics... Big Data PLS Methods JSTAR 2016, Rennes 18/54

PLS family PLS = Partial Least Squares or Projection to Latent Structures Four main methods coexist in the literature: (i) Partial Least Squares Correlation (PLSC) also called PLS-SVD; (ii) PLS in mode A (PLS-W2A, for Wold s Two-Block, Mode A PLS); (iii) PLS in mode B (PLS-W2B) also called Canonical Correlation Analysis (CCA); (iv) Partial Least Squares Regression (PLSR, or PLS2). Big Data PLS Methods JSTAR 2016, Rennes 19/54

PLS family PLS = Partial Least Squares or Projection to Latent Structures Four main methods coexist in the literature: (i) Partial Least Squares Correlation (PLSC) also called PLS-SVD; (ii) PLS in mode A (PLS-W2A, for Wold s Two-Block, Mode A PLS); (iii) PLS in mode B (PLS-W2B) also called Canonical Correlation Analysis (CCA); (iv) Partial Least Squares Regression (PLSR, or PLS2). (i),(ii) and (iii) are symmetric while (iv) is asymmetric. Different objective functions to optimise. Good news: all use the singular value decomposition (SVD). Big Data PLS Methods JSTAR 2016, Rennes 19/54

Singular Value Decomposition (SVD) Definition 1 Let a matrix M : p q of rank r: r M = U V T = δ l u l v T l, (1) l=1 U = (u l ) : p p and V = (v l ) : q q are two orthogonal matrices which contain the normalised left (resp. right) singular vectors = diag(δ 1,..., δ r, 0,..., 0): the ordered singular values δ 1 δ 2 δ r > 0. Note: fast and efficient algorithms exist to solve the SVD. Big Data PLS Methods JSTAR 2016, Rennes 20/54

Connexion between SVD and maximum covariance We were able to describe the optimization problem of the four PLS methods as: (u, v ) = argmax Cov(X h 1 u, Y h 1 v), h = 1,..., H. u 2 = v 2 =1 Matrices X h and Y h are obtained recursively from X h 1 and Y h 1. Big Data PLS Methods JSTAR 2016, Rennes 21/54

Connexion between SVD and maximum covariance We were able to describe the optimization problem of the four PLS methods as: (u, v ) = argmax Cov(X h 1 u, Y h 1 v), h = 1,..., H. u 2 = v 2 =1 Matrices X h and Y h are obtained recursively from X h 1 and Y h 1. The four methods differ by the deflation process, chosen so that the above scores or weight vectors satisfy given constraints. Big Data PLS Methods JSTAR 2016, Rennes 21/54

Connexion between SVD and maximum covariance We were able to describe the optimization problem of the four PLS methods as: (u, v ) = argmax Cov(X h 1 u, Y h 1 v), h = 1,..., H. u 2 = v 2 =1 Matrices X h and Y h are obtained recursively from X h 1 and Y h 1. The four methods differ by the deflation process, chosen so that the above scores or weight vectors satisfy given constraints. The solution at step h is obtained by computing only the first triplet (δ 1, u 1, v 1 ) of singular elements of the SVD of M h 1 = X T h 1 Y h 1: (u, v ) = (u 1, v 1 ) Big Data PLS Methods JSTAR 2016, Rennes 21/54

Connexion between SVD and maximum covariance We were able to describe the optimization problem of the four PLS methods as: (u, v ) = argmax Cov(X h 1 u, Y h 1 v), h = 1,..., H. u 2 = v 2 =1 Matrices X h and Y h are obtained recursively from X h 1 and Y h 1. The four methods differ by the deflation process, chosen so that the above scores or weight vectors satisfy given constraints. The solution at step h is obtained by computing only the first triplet (δ 1, u 1, v 1 ) of singular elements of the SVD of M h 1 = X T h 1 Y h 1: Why is this useful? (u, v ) = (u 1, v 1 ) Big Data PLS Methods JSTAR 2016, Rennes 21/54

SVD properties Theorem 2 Eckart-Young (1936) states that the (truncated) SVD of a given matrix M (of rank r) provides the best reconstitution (in a least squares sense) of M by a matrix with a lower rank k: min M A of rank k A 2 F = k M δ l u l v T l l=1 2 F = r l=k+1 δ 2 l. If the minimum is searched for matrices A of rank 1, which are under the form ũṽ T where ũ, ṽ are non-zero vectors, we obtain M ũṽ T 2 r δ 2 l = M δ1 u 1 v T 2 1. F min ũ,ṽ F = l=2 Big Data PLS Methods JSTAR 2016, Rennes 22/54

SVD properties Thus, solving argmin ũ,ṽ M h 1 ũṽ T 2 and norming the resulting vectors gives us u 1 and v 1. This is another approach to solve the PLS optimization problem. F (2) Big Data PLS Methods JSTAR 2016, Rennes 23/54

Towards sparse PLS Shen and Huang (2008) connected (2) (in a PCA context) to least square minimisation in regression: 2 M h 1 ũṽ T 2 = 2 vec(m F h 1 ) (I } {{ } p ũ)ṽ = } {{ } vec(m h 1 ) (ṽ I } {{ } q )ũ. } {{ } y Xβ Possible to use many existing variable selection techniques using regularization penalties. 2 y Xβ 2 Big Data PLS Methods JSTAR 2016, Rennes 24/54

Towards sparse PLS Shen and Huang (2008) connected (2) (in a PCA context) to least square minimisation in regression: 2 M h 1 ũṽ T 2 = 2 vec(m F h 1 ) (I } {{ } p ũ)ṽ = } {{ } vec(m h 1 ) (ṽ I } {{ } q )ũ. } {{ } y Xβ Possible to use many existing variable selection techniques using regularization penalties. We propose iterative alternating algorithms to find normed vectors ũ/ ũ and ṽ/ ṽ that minimise the following penalised sum-of-squares criterion M h 1 ũṽ T 2 F + P λ(ũ, ṽ), for various penalization terms P λ (ũ, ṽ). We obtain several sparse versions (in terms of the weights u and v) of the four methods (i) (iv). 2 Big Data PLS Methods JSTAR 2016, Rennes 24/54 y Xβ 2

Sparse PLS models For cases (i) (iv), Aim: obtaining sparse weight vectors u h and v h. Associated component scores (i.e., latent variables) ξ h := X h 1 u h and ω h := Y h 1 v h, h = 1,..., H, for a small number of components. Recursive procedure with objective function involving X h 1 and Y h 1 decomposition (approximation) of the original matrices X and Y: X = Ξ H C T H + F X,H, Y = Ω HD T H + F Y,H, (3) where Ξ = (ξ h ) and Ω = (ω h ). For the regression mode, we have the multivariate linear regression model Y = X B PLS + E, with B PLS = U H (C T H U H) 1 P H D T H and E is a matrix of residuals. Big Data PLS Methods JSTAR 2016, Rennes 25/54

Example case (ii): PLS-W2A Definition 3 The objective function at step h is subject to the constraints: (u h, v h ) = argmax Cov(X h 1 u, Y h 1 v) u 2 = v 2 =1 Cov(ξ h, ξ j ) = Cov(ω h, ω j ) = 0, 1 j < h. In order to satisfy these constraints: X h = P ξ h X h 1 and Y h = P ω h Y h 1, (X 0 = X, Y 0 = Y) where ξ h (resp. Ω h ) is the first left (resp. right) singular vector obtained by applying a SVD to M h 1 := X T h 1 Y h 1, h = 1,..., H. Big Data PLS Methods JSTAR 2016, Rennes 26/54

Regression mode (iv): PLSR, PLS2 Aim of this asymmetric model is prediction. PLS2 finds latent variables that model X and simultaneously predict Y. Difference with PLS-W2A is the deflation step: X h = P ξ h X h 1 and Y h = P ξ h Y h 1. Big Data PLS Methods JSTAR 2016, Rennes 27/54

The algorithm Main steps of the iterative algorithm 1. X 0 = X, Y 0 = Y h = 1 2. M h 1 := X T h 1 Y h 1. 3. SVD: extraction of the first pair of singular vectors u h and v h. 4. Sparsity step. Produces sparse weights u sparse and v sparse. 5. Latent variables: ξ h = X h 1 u sparse and ω h = Y h 1 v sparse 6. Slope coefficients: c h = X T ξ h 1 h/ξ T h ξ h for both modes d h = Y T ξ h 1 h/ξ T h ξ h for PLSR regression mode e h = Y T ω h 1 h/ω Tω h h for PLS mode A 7. Deflation: X h = X h 1 ξ h c T for both modes h Y h = Y h 1 ξ h d T h for PLSR regression mode Y h = Y h 1 ω h e T for PLS mode A h 8. If h = H stop, else h = h + 1 and goto step 2. Big Data PLS Methods JSTAR 2016, Rennes 28/54

Introducing sparsity Sparsity implies many zeros in a vector or a matrix. (Credits: Jun Liu, Shuiwang Ji, and Jieping Ye) Big Data PLS Methods JSTAR 2016, Rennes 29/54

Introducing sparsity Let θ be the model parameters to be estimated. A commonly employed method for estimating θ is min [loss(θ) + λ penalty(θ)]. This is equivalent to the following method: min loss(θ) subject to the constraints penalty(θ) z (for some z). Example: loss(θ) = 0.5 θ v 2 2 for some fixed vector v. Big Data PLS Methods JSTAR 2016, Rennes 30/54

Why does L 1 induce sparsity? Analysis in 1D (comparison with L 2 ) 0.5 (θ v) 2 + λ θ If v λ, θ = v λ If v λ, θ = v + λ Else, θ = 0 (sparsity!) 0.5 (θ v) 2 + λθ 2 θ = No sparsity here. v 1 + 2λ Nondifferentiable at 0 Differentiable at 0 Big Data PLS Methods JSTAR 2016, Rennes 31/54

Why does L 1 induce sparsity? Understanding from the projection Big Data PLS Methods JSTAR 2016, Rennes 32/54

Why does L 1 induce sparsity? Understanding from constrained optimization Big Data PLS Methods JSTAR 2016, Rennes 33/54

sparse PLS (spls) In spls, the optimisation problem to solve is Mh min u h v T 2 h + P u h,v F λ 1,h (u h ) + P λ2,h (v h ), h M h u h v T h 2 F = p i=1 q j=1 (m ij u ih v jh ) 2, M h = X T h Y h for each iteration h. P λ1,h (u h ) = p i=1 2λh 1 u i and P λ2,h (v h ) = q j=1 2λh 2 v i Big Data PLS Methods JSTAR 2016, Rennes 34/54

sparse PLS (spls) In spls, the optimisation problem to solve is Mh min u h v T 2 h + P u h,v F λ 1,h (u h ) + P λ2,h (v h ), h M h u h v T h 2 F = p i=1 q j=1 (m ij u ih v jh ) 2, M h = X T h Y h for each iteration h. P λ1,h (u h ) = p i=1 2λh 1 u i and P λ2,h (v h ) = q j=1 2λh 2 v i Iterative solution. Applying the thresholding function g soft (x, λ) = sign(x)( x λ) + to the vector Mv h componentwise to get u h. to the vector M T u h componentwise to get v h. Big Data PLS Methods JSTAR 2016, Rennes 34/54

group PLS (gpls) X and Y can be divided respectively into K and L sub-matrices (groups) X (k) : n p k and Y (l) : n q l. Same idea as Yuan and Lin (2006), we use group lasso penalties: K L P λ1 (u) = λ 1 pk u (k) 2 and P λ2 (v) = λ 2 ql v (l) 2, k=1 where u (k) (resp. v (l) ) is the weight vector associated to the k-th (resp. l-th) block. l=1 In gpls, the optimisation problem to solve is K L M (k,l) u (k) v (l)t 2 + P λ 1 (u) + P λ2 (v), F k=1 l=1 M (k,l) = X (k) Y (l)t. Remark if the k-th block is composed by only one variable then u (k) 2 = (u (k) ) 2 = u (k). Big Data PLS Methods JSTAR 2016, Rennes 35/54

group PLS (gpls) Previous objective function can be written as K k=1 { M (k, ) u (k) v T 2 F + λ } 1 pk u (k) 2 + Pλ2 (v) where M (k, ) = X (k) Y T. We can optimize (for v fixed) over groupwise components of u separately. First term above expands as: trace[m (k, ) M (k, )T ] 2trace[u (k) v T M (k, )T ] + trace[u (k) u (k)t ] Optimal u (k) thus optimizes trace[u (k) u (k)t ] 2trace[u (k) v T M (k, )T ] + λ 1 pk u (k) 2. This objective function is convex, so the optimal solution is characterized by subgradient equations (subdifferential equals to 0). Big Data PLS Methods JSTAR 2016, Rennes 36/54

Subdifferential Subderivative, subgradient, and subdifferential generalize the derivative to functions which are not differentiable (e.g., x is nondifferentiable at 0). The subdifferential of a function is set-valued. Blue: convex function (nondifferentiable at x 0 ). Slope of each red line = a subderivative at x 0. The set [a, b] of all subderivatives is called the subdifferential of the function f at x 0. If f is convex and its subdifferential at x 0 contains exactly one subderivative, then f is differentiable at x 0. Big Data PLS Methods JSTAR 2016, Rennes 37/54

We have and f(x) f(x 0 ) a = lim x x x x 0 0 f(x) f(x 0 ) b = lim. x x + x x 0 0 Example: Consider the function f(x) = x which is convex. Then, the subdifferential at the origin is the interval [a, b] = [ 1, 1]. The subdifferential at any point x 0 < 0 is the singleton set { 1}, while the subdifferential at any point x 0 > 0 is the singleton set {1}. Big Data PLS Methods JSTAR 2016, Rennes 38/54

For group k, u (k) must satisfy that the subdifferential is null: 2u (k) + 2M (k, ) v = λ 1 pk θ, (4) where θ is a subgradient of u (k) 2 evaluated at u (k). So, u (k) if u (k) 0; θ = u (k) 2 {θ : θ 2 1} if u (k) = 0. We can see that subgradient equations (4) are satisfied with u (k) = 0 if M (k, ) v 2 2 1 λ 1 pk. (5) For u (k) 0, equation (4) gives u (k) 2u (k) + 2M (k, ) v = λ 1 pk. u (k) (6) 2 Combining equations (5) and (6), we find: ( u (k) = 1 λ ) 1 pk M (k, ) v, 2 M (k, ) k = 1,..., K, (7) v 2 + where (a) + = max(a, 0). Big Data PLS Methods JSTAR 2016, Rennes 39/54

In the same vein, optimisation over v for a fixed u is also obtained by optimising over groupwise components: v (l) = 1 λ 2 2 ql M (,l)t u 2 We thus obtain the following theorem. + M (,l)t u, l = 1,..., L. (8) Big Data PLS Methods JSTAR 2016, Rennes 40/54

group PLS (gpls) Theorem 4 Solution of the group PLS optimisation problem is given by: ( u (k) = 1 λ ) 1 pk M (k, ) v (for fixed v) 2 M (k, ) v 2 and v (l) = 1 λ 2 ql M (,l)t u 2 M (,l)t u 2 + + (for fixed u). Note: we will iterate until convergence of u (k) and v (l), using alternatively one of the above formulas. Big Data PLS Methods JSTAR 2016, Rennes 41/54

sparse group PLS: sparsity within groups Following Simon et al. (2013), we introduce sparse group lasso penalties: K P λ1 (u) = (1 α 1 )λ 1 pk u (k) 2 + α 1 λ 1 u 1, k=1 L P λ2 (v) = (1 α 2 )λ 2 ql v (l) 2 + α 2 λ 2 v 1. l=1 Big Data PLS Methods JSTAR 2016, Rennes 42/54

sparse group PLS (sgpls) Theorem 5 Solution of the sparse group PLS optimisation problem is given by: u (k) = 0 if g soft ( M (k, ) v, λ 1 α 1 /2 ) 2 λ 1 (1 α 1 ) p k, otherwise u (k) = 1 ( 2 gsoft M (k, ) v, λ 1 α ) 1 /2 λ 1 (1 α 1 ) g soft ( M (k, ) v, λ1 α ) 1 /2 p k g soft ( M (k, ) v, λ 1 α ) 1 /2. 2 We have v (l) = 0 if ( ) g soft M (,l)t u, λ 2 α 2 /2 λ 2 (1 α 2 ) q l 2 and v (l) = ( ) 1 ( ) 2 gsoft M (,l)t u, λ 1 α 1 /2 λ 2 (1 α 2 ) g soft M (,l)t u, λ 2 α 2 /2 q l ) g (M soft (,l)t u, λ 2 α 2 /2 2 otherwise. Similar proof (see our paper in Bioinformatics, 2016). Big Data PLS Methods JSTAR 2016, Rennes 43/54

R package: sgpls sgpls package implements spls, gpls and sgpls methods: http://cran.r-project.org/web/packages/sgpls/index.html Includes some functions for choosing the tuning parameters related to the predictor matrix for different sparse PLS model (regression mode). Some simple code to perform a sgpls: model.sgpls <- sgpls(x, Y, ncomp = 2, mode = "regression", keepx = c(4, 4), keepy = c(4, 4), ind.block.x = ind.block.x, ind.block.y = ind.block.y, alpha.x = c(0.5, 0.5), alpha.y = c(0.5, 0.5)) Last version also includes sparse group Discriminant Analysis. Big Data PLS Methods JSTAR 2016, Rennes 44/54

Regularized PLS scalable for BIG-DATA What happens in a MASSIVE DATA SET context? Big Data PLS Methods JSTAR 2016, Rennes 45/54

Regularized PLS scalable for BIG-DATA What happens in a MASSIVE DATA SET context? Massive datasets. The size of the data is large and analysing it takes a significant amount of time and computer memory. Emerson & Kane (2012). Dataset considered large if it exceeds 20% of the RAM (Random Access Memory) on a given machine, and massive if it exceeds 50% Big Data PLS Methods JSTAR 2016, Rennes 45/54

Case of a lot of observations: two massive data sets X: n p matrix and Y: n q matrix due to a large number of observations. We suppose here that n is very large, but not p nor q. Big Data PLS Methods JSTAR 2016, Rennes 46/54

Case of a lot of observations: two massive data sets X: n p matrix and Y: n q matrix due to a large number of observations. We suppose here that n is very large, but not p nor q. PLS algorithm mainly based on the SVD of M h 1 = X T h 1 Y h 1: Big Data PLS Methods JSTAR 2016, Rennes 46/54

Case of a lot of observations: two massive data sets X: n p matrix and Y: n q matrix due to a large number of observations. We suppose here that n is very large, but not p nor q. PLS algorithm mainly based on the SVD of M h 1 = X T h 1 Y h 1: Dimension of M h 1 : p q matrix!! This matrix fits into memory. But not X nor Y. Big Data PLS Methods JSTAR 2016, Rennes 46/54

Computation of M = X T Y by chunks G M = X T Y = X T (g) Y (g) g=1 All terms fit (successively) into memory! Big Data PLS Methods JSTAR 2016, Rennes 47/54

Computation of M = X T Y by chunks using R No need to load the big matrices X and Y Use memory-mapped files (called filebacking ) through the bigmemory package to allow matrices to exceed the RAM size. A big.matrix is created which supports the use of shared memory for efficiency in parallel computing. foreach: package for running in parallel the computation of M by chunks Big Data PLS Methods JSTAR 2016, Rennes 48/54

Computation of M = X T Y by chunks using R No need to load the big matrices X and Y Use memory-mapped files (called filebacking ) through the bigmemory package to allow matrices to exceed the RAM size. A big.matrix is created which supports the use of shared memory for efficiency in parallel computing. foreach: package for running in parallel the computation of M by chunks Regularized PLS algorithm: Computation of the components ( Scores ): Xu (n 1) and Yv (n 1) Easy to compute by chunks and store in a big.matrix object. Big Data PLS Methods JSTAR 2016, Rennes 48/54

Illustration of group PLS with Big-Data Simulated: X (5GB) and Y (5GB); n = 560, 000 observations, p = 400 and q = 500; Linked by two latent variables, made up of sparse linear combinations of the original variables; Both X and Y have a group structure: 20 groups of 20 variables for X and 25 groups of 20 variables for Y; Only 4 groups in each data set are relevant, 5 variables in each of these groups are not relevant. Big Data PLS Methods JSTAR 2016, Rennes 49/54

Figure 1: Comparison of gpls and BIG-gPLS (for small n = 1, 000) Big Data PLS Methods JSTAR 2016, Rennes 50/54

Figure 2: Use of BIG-gPLS. Left: small n. Right: Large n. Blue: truth. Red: Recovered. Big Data PLS Methods JSTAR 2016, Rennes 51/54

Regularised PLS Discriminant Analysis Categorical response variable becomes a dummy matrix in PLS algorithms: Big Data PLS Methods JSTAR 2016, Rennes 52/54

Concluding Remarks and Take Home Message We were able to derive a simple unified algorithm that perfoms standard, sparse, group and sparse group versions of the four classical PLS algorithms (i) (iv). (And also PLSDA.) We used big memory objects, and a simple trick that makes our procedure scalable to big data (large n). We also parallelized the code for faster computation. This will soon been made available in our new R package: bigsgpls. Eager to apply to real neuroimaging data sets! We are currently working on a batch version of this algorithm, as well as a large n and large p version of it. Big Data PLS Methods JSTAR 2016, Rennes 53/54

References Yuan M. and Lin Y. (2006) Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (1), 49 67. Simon N., Friedman J., Hastie T. and Tibshirani R. (2013) A Sparse-group Lasso. Journal of Computational and Graphical Statistics, 22 (2), 231 245. Liquet B., Lafaye de Micheaux P., Hejblum B. and Thiebaut R., (2016) Group and Sparse Group Partial Least Square Approaches Applied in Genomics Context. Bioinformatics, 32(1), 35 42. Lafaye de Micheaux P., Liquet B. and Sutton M., A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data (in progress). Thank you! Questions? Big Data PLS Methods JSTAR 2016, Rennes 54/54