Regularized Linear Models in Stacked Generalization

Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic Department of Computer Science University of Colorado at Boulder USA June 11, 2009 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 1 / 33

How to combine classifiers? Which classifiers? How to combine? Adaboost, Random Forest prescribe classifiers and combiner We want L 1000 heterogeneous classifiers Vote/Average/Forward Stepwise Selection/Linear/Nonlinear? Our combiner: Regularized Linear Model Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 2 / 33

Outline 1 Introduction How to combine classifiers? 2 Model Stacked Generalization StackingC Linear Regression and Regularization 3 Experiments Setup Results Discussion Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 3 / 33

Stacked Generalization Combiner is produced by a classification algorithm Training set = base classifier predictions on unseen data + labels Learn to compensate for classifier biases Linear and nonlinear combiners What classification algorithm should be used? Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 5 / 33

Stacked Generalization - Combiners Wolpert, 1992: relatively global, smooth combiners Ting & Witten, 1999: linear regression combiners Seewald, 2002: low-dimensional combiner inputs Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 6 / 33

Problems Caruana et al., 2004: Stacking performs poorly because regression overfits dramatically when there are 2000 highly correlated input models and only 1k points in the validation set. How can we scale up stacking to a large number of classifiers? Our hypothesis: regularized linear combiner will reduce variance prevent overfitting increase accuracy Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 7 / 33

Posterior Predictions in Multiclass Classification p y(x) x Classification with d = 4, k = 3 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 8 / 33

Ensemble Methods for Multiclass Classification ŷ y (x 1, x 2 ) x 1 x 2 y 1 (x) y 2 (x) Multiple classifier system with 2 classifiers x Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 9 / 33

Stacked Generalization ŷ y (x ) x y 1 (x) y 2 (x) Stacked generalization with 2 classifiers x Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 10 / 33

Classification via Regression ŷ y A (x ) y B (x ) y C (x ) x y 1 (x) y 2 (x) Stacking using Classification via Regression x Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 11 / 33

StackingC ŷ y A (x A ) y B (x B ) y C (x C ) x y 1 (x) y 2 (x) StackingC, class-conscious stacked generalization x Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 12 / 33

Linear Models Linear model for use in Stacking or StackingC ŷ = d i=1 β ix i + β 0 Least Squares: L = y X β 2 Problems: High variance Overfitting Ill-posed problem Poor accuracy Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 13 / 33

Regularization Increase bias a little, decrease variance a lot Constrain weights reduce flexibility prevent overfitting Penalty terms in our studies: Ridge Regression: L = y X β 2 + λ β 2 Lasso Regression: L = y X β 2 + λ β 1 Elastic Net Regression: L = y X β 2 + λ β 2 + (1 λ) β 1 Lasso/Elastic Net produce sparse models Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 14 / 33

Model About 1000 base classifiers making probabilistic predictions Stacked Generalization to create combiner StackingC to reduce dimensionality Convert multiclass to regression Use linear regression Regularization on the weights Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 15 / 33

Datasets Table: Datasets and their properties Dataset Attributes Instances Classes balance-scale 4 625 3 glass 9 214 6 letter 16 4000 26 mfeat-morphological 6 2000 10 optdigits 64 5620 10 sat-image 36 6435 6 segment 19 2310 7 vehicle 18 846 4 waveform-5000 40 5000 3 yeast 8 1484 10 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 18 / 33

Base Classifiers About 1000 base classifiers for each problem 1 Neural Network 2 Support Vector Machine (C-SVM from LibSVM) 3 K-Nearest Neighbor 4 Decision Stump 5 Decision Tree 6 AdaBoost.M1 7 Bagging classifier 8 Random Forest (Weka) 9 Random Forest (R) Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 19 / 33

Results select vote average sg sg sg best linear lasso ridge balance 0.9872 0.9234 0.9265 0.9399 0.9610 0.9796 glass 0.6689 0.5887 0.6167 0.5275 0.6429 0.7271 letter 0.8747 0.8400 0.8565 0.5787 0.6410 0.9002 mfeat 0.7426 0.7390 0.7320 0.4534 0.4712 0.7670 optdigits 0.9893 0.9847 0.9858 0.9851 0.9660 0.9899 sat-image 0.9140 0.8906 0.9024 0.8597 0.8940 0.9257 segment 0.9768 0.9567 0.9654 0.9176 0.6147 0.9799 vehicle 0.7905 0.7991 0.8133 0.6312 0.7716 0.8142 waveform 0.8534 0.8584 0.8624 0.7230 0.6263 0.8599 yeast 0.6205 0.6024 0.6105 0.2892 0.4218 0.5970 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 20 / 33

Statistical Analysis Pairwise Wilcoxon Signed-Rank Tests Ridge outperforms unregularized at p 0.002 Lasso outperforms unregularized at p 0.375 Validates hypothesis: regularization improves accuracy Ridge outperforms lasso at p 0.0019 Dense techniques outperform sparse techniques Ridge outperforms Select-Best at p 0.084 Properly trained model better than single best Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 21 / 33

Baseline Algorithms Average outperforms Vote at p 0.014 Probabilistic predictions are valuable Select-Best outperforms Average at p 0.084 Validation/training is valuable Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 22 / 33

Subproblem/Overall Accuracy - I RMSE 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08.. 0.06 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1 10 10 2 10 3 Ridge Parameter..... Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 23 / 33

Subproblem/Overall Accuracy - II Accuracy 0.93 0.92 0.91 0.9 0.89 0.88 0.87 0.86.. 0.85 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1 10 10 2 10 3 Ridge Parameter.......... Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 24 / 33

Subproblem/Overall Accuracy - III Accuracy 0.94 0.93 0.92 0.91 0.9 0.89 0.88 0.87 0.86...... 0.85 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 RMSE on Subproblem 1... Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 25 / 33

Accuracy for Elastic Nets Accuracy 0.93 0.925 0.92 0.915 0.91 alpha=0.95 alpha=0.5 alpha=0.05 select-best 0.905 0.9 10-5 2 5 10-4 2 5 10-3 2 5 10-2 2 5 10-1 2 5 1 2 5 Penalty Figure: Overall accuracy on sat-image with various parameters for elastic-net. Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 26 / 33

Partial Ensemble Selection Sparse techniques perform Partial Ensemble Selection Choose from classifiers and predictions Allow classifiers to focus on subproblems Example: Benefit from a classifier good at separating A from B but poor at A/C, B/C Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 27 / 33

Partial Selection 7 6 5 4 3 2 1 0.10 0.05 0.00 0.05 0.10 Log Lambda Class 1 19 19 33 7 6 5 4 3 2 1 0.10 0.05 0.00 0.05 0.10 Log Lambda Class 2 8 6 32 7 6 5 4 3 2 1 0.10 0.05 0.00 0.05 0.10 Log Lambda Class 3 66 26 14 123 4 5 6 8 10 12 14 16 18 19 20 21 22 24 25 26 27 28 29 30 31 32 34 35 36 37 38 39 40 42 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 74 76 77 78 79 80 82 83 84 86 88 90 91 92 93 94 95 96 97 98 100 101 102 103 104 106 107 108 109 110 111 112 113 114 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 141 142 143 144 146 147 148 149 150 151 154 156 157 158 159 160 162 164 165 166 168 170 171 172 173 174 176 177 178 179 180 182 184 185 186 187 188 189 190 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 218 220 221 227 228 231 233 234 235 236 238 241 244 251 252 253 254 255 256 258 261 262 263 267 269 271 272 273 274 275 278 281 282 283 284 285 287 289 293 295 296 298 301 302 303 308 309 310 312 314 315 318 320 323 324 327 328 329 331 332 333 334 335 336 338 341 342 344 345 346 347 348 351 352 354 355 356 358 361 364 365 367 372 373 374 376 378 382 387 388 389 391 392 393 394 396 397 402 403 405 406 407 408 411 412 415 416 420 422 423 426 428 429 432 435 438 441 443 444 446 447 448 449 450 453 454 456 457 458 462 464 465 466 472 473 478 483 484 485 486 487 490 492 493 498 501 502 503 504 505 506 507 509 511 512 513 514 515 518 520 521 523 530 532 533 534 536 538 542 543 544 546 548 550 551 552 553 554 556 558 560 562 563 568 569 572 573 574 578 584 590 591 592 593 594 598 603 604 607 608 611 612 616 618 619 622 623 624 627 628 629 636 638 642 643 646 647 648 652 654 655 656 662 663 670 671 672 681 682 683 684 688 690 692 693 694 696 697 698 703 704 712 713 714 715 717 718 723 724 725 726 727 731 732 733 734 736 737 738 739 742 743 745 747 748 749 750 752 754 756 757 758 763 764 765 767 769 774 775 776 778 781 782 783 784 788 789 792 793 794 796 797 798 805 806 808 809 812 813 815 816 818 822 824 825 826 827 833 835 838 843 844 846 847 850 852 854 855 857 858 861 862 864 865 866 867 868 870 872 875 878 880 882 883 886 887 888 890 893 894 896 897 898 902 903 905 906 908 909 913 915 916 918 922 925 926 928 929 930 931 932 933 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 974 975 976 977 978 979 980 981 982 984 985 986 987 988 989 990 991 992 993 994 995 996 997 999 0.05 0.10 4 127 39 19 0.05 0.10 5 55 28 41 2 3 4 10 11 13 24 28 34 37 39 40 42 49 52 53 54 58 59 65 71 75 77 84 85 86 88 91 98 99 101 104 107 110 111 112 114 115 125 131 133 150 157 159 163 165 166 168 169 171 178 180 184 187 191 192 193 194 195 198 199 206 208 211 216 222 228 231 251 265 266 303 315 325 342 348 364 374 375 394 402 408 432 433 434 440 449 466 468 474 482 506 513 528 546 554 562 567 571 575 582 592 602 623 635 695 703 742 752 761 762 765 776 814 823 826 864 866 888 894 904 921 937 939 970 973 974 983 985 986 990 995 0.05 0.10 6 79 40 38 4 5 6 15 16 17 18 22 25 28 30 32 34 37 38 39 49 50 57 58 59 63 65 67 69 70 71 74 75 80 82 90 92 95 96 99 100 101 103 104 106 110 111 112 114 116 119 125 126 128 133 134 135 139 140 144 148 155 156 158 159 160 163 167 169 171 172 175 178 182 187 193 195 197 198 199 200 202 208 209 210 232 247 249 255 297 304 306 314 317 322 329 334 347 365 369 384 395 417 443 445 446 450 453 455 456 465 473 475 476 482 485 493 504 526 530 544 550 556 557 571 574 576 614 652 664 686 703 709 715 726 727 730 735 736 752 754 773 774 784 802 815 823 842 852 862 876 904 907 909 913 916 923 927 932 937 940 944 949 951 952 956 971 973 981 983 987 989 991 994 995 997 999 Figure: Coefficient profiles for the first three subproblems in StackingC for the sat-image dataset with elastic net regression at α = 0.95. Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 28 / 33

Selected Classifiers Classifier red cotton grey damp veg v.damp total adaboost-500 0.063 0 0.014 0.000 0.0226 0 0.100 ann-0.5-32-1000 0 0 0.061 0.035 0 0.004 0.100 ann-0.5-16-500 0.039 0 0 0.018 0.009 0.034 0.101 ann-0.9-16-500 0.002 0.082 0 0 0.007 0.016 0.108 ann-0.5-32-500 0.000 0.075 0 0.100 0.027 0 0.111 knn-1 0 0 0.076 0.065 0.008 0.097 0.246 Table: Selected posterior probabilities and corresponding weights for the sat-image problem for elastic net StackingC with α = 0.95 for the 6 models with highest total weights. Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 29 / 33

Conclusions Regularization is essential in Linear StackingC Trained linear combination outperforms Select-Best Dense combiners outperform sparse combiners Sparse models allow classifiers to specialize in subproblems Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 30 / 33

Future Work Examine full Bayesian solutions Constrain coefficients to be positive Choose a single regularizer for all subproblems Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 31 / 33

Acknowledgments PhET Interactive Simulations Turing Institute UCI Repository University of Colorado at Boulder Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 32 / 33

Questions? Questions? Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 33 / 33