Accuracy of imputed 50k genotypes from 3k and 6k chips using FImpute version 2 Sargolzaei, M. 12, Schenkel, F. 2 and Chesnais, J. 1 1 L'Alliance Boviteq, Saint-Hyacinthe, QC, Canada 2 University of Guelph, Centre for Genetic Improvement of Livestock, Guelph, ON, Canada Introduction The accuracy of GPA depends on many factors. The quality of input data is one of the most important. In genomic evaluation two sources of data are used: 1) phenotypes (or proofs derived from them) 2) genotypes. Traditional proofs (EBVs) used in the estimation set tend to be quite reliable. Genotypes received from the laboratory are also very accurate. Four different Bovine SNP chips are currently commercially available, namely HD, 50kV1, 50KV2 and 3k. A new 6k SNP chip will be available soon. The accuracy of HD and 50k genotypes is extremely high. However, the Golden Gate technology used in the 3k chip results in around 1% genotyping errors. The 3k panel has been extensively used to perform inexpensive genotyping on a large number of animals so more producers can afford the cost of genotyping. However, with the 3k SNP chip there exist more errors in imputed genotypes, which may affect GPA accuracy for certain animals. The experience with 3k imputation was very successful for animals with immediate genotyped ancestors (in Holstein, most animals are in this group). However, animals with missing pedigree or ungenotyped parents seem to have lower imputation accuracy from 3k to 50k mainly due to genotyping errors and the difficulty of determining the gametic phase using the 3k panel. The density of the 3k panel is sufficient for capturing close linkage (family) information but not high enough to capture short range linkage disequilibrium. Imputation accuracy is then mainly influenced by the direct relationship between the density of the low density (LD) chip and the number of generations from the genotyped ancestors. With a denser LD panel more information can be recovered from distant 50k or HD genotyped ancestors. Therefore a new LD panel (6k) has been developed by Illumina with Infinium technology to improve genotyping error rate and increase genome coverage. In the present study, the accuracy of imputation from 6k to 50k has been assessed and compared to that of the 3k SNP chip. Materials and Methods Holstein, Jersey and Brown Swiss data sets from the Canadian Dairy Network (CDN) August genomic run were used. There were 42,503 useable SNP on the 50k panel. Table 1 shows simple statistics for each breed. The data set was divided into a reference and a validation group. The validation group included 50k genotyped animals born after 2009 for HO and JE and after 2008 for BS. Three scenarios for validation animals were considered: 1. 2,641 SNP were kept and the rest of genotypes were set to missing. 1% error was randomly simulated on validation animals' genotypes. Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 1
2. 2,641 SNP were kept and the rest of genotypes were set to missing (no errors were simulated). 3. 6,701 SNP were kept and the rest of genotypes were set to missing. The reference group for family imputation included all other animals including 3k animals The reference group for population imputation consisted of 2,000 individuals for HO and JE, and 1,553 for BS. The genotypes of validation animals (3k/6k) were imputed to 50k based on the information from the reference animals using FImpute version2. Imputation was done in 3 steps a. Genotypes known with certainty were filled b. Family imputation was carried out c. Population imputation based on haplotypes from the second step was performed After imputation, correct, incorrect and missing call rate were computed for all originally called SNP (low density SNP were included). Table 1 - Statistics Breed Total 50k 3k Val No. ped cnfl * No. ref ** Holstein 106,437 67,160 39,277 20,031 893 2,000 Jersey 13,248 5,786 7,462 1,289 239 2,000 Brown Swiss 2,335 2,031 305 209 9 1,553 * No. pedigree conflicts removed (Mendelian error rate >2%). ** No. of reference individuals for population imputation Results: Table 2 - Overall imputation accuracy - 3k vs 6k Breed 3k+1% error 6k Gain Holstein 97.81 99.47 1.66 Jersey 97.07 99.12 2.05 Brown Swiss 95.91 98.97 3.06 Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 2
Table 3 - Imputation accuracy for different scenarios - Holstein Sire Dam No. Correct Incorrect Missing Correct Incorrect Missing Correct Incorrect Missing 50k 12,593 99.17 0.83 0.00 98.77 1.23 0.00 99.71 0.29 0.00 50k 3k 903 97.94 1.88 0.18 97.53 2.27 0.20 98.99 0.77 0.24 0k 6,081 97.17 2.83 0.01 96.29 3.71 0.01 99.17 0.83 0.00 Unknown 19 95.12 4.88 0.00 93.99 6.00 0.01 98.55 1.45 0.00 50k 34 96.48 3.50 0.02 95.98 3.97 0.05 98.02 1.94 0.05 3k 3k 18 96.50 3.42 0.08 95.96 3.92 0.13 97.98 1.92 0.10 50k 121 96.61 3.36 0.03 95.96 4.01 0.03 98.65 1.34 0.01 0k 3k 11 93.67 6.19 0.14 92.87 6.95 0.18 97.41 2.40 0.19 0k 157 90.33 9.66 0.02 88.58 11.40 0.02 96.92 3.07 0.01 50k 21 95.85 4.13 0.01 94.70 5.28 0.03 98.79 1.21 0.00 Unknown 3k 4 94.22 5.59 0.19 92.94 6.85 0.21 98.22 1.67 0.11 0k 35 90.09 9.90 0.00 88.31 11.68 0.01 96.77 3.21 0.02 Unknown 34 92.46 7.53 0.01 90.32 9.67 0.01 97.99 2.01 0.00 Overall 20,031 98.37 1.61 0.01 97.81 2.18 0.01 99.47 0.52 0.01 Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 3
Table 4 - Number of animals with high error rate and missing rate for different scenarios - Holstein Sire Dam No. >5%Err >10%Err >5%Miss >5%Err >10%Err >5%Miss >5%Err >10%Err >5%Miss 50k 12,593 5 1 0 5 1 0 2 0 0 50k 3k 903 7 3 0 7 3 0 3 0 0 0k 6,081 470 15 0 1,574 21 0 13 3 0 Unknown 19 5 0 0 15 1 0 0 0 0 50k 34 8 3 0 9 3 0 5 0 0 3k 3k 18 3 1 0 3 2 0 3 0 0 50k 121 24 14 0 32 15 0 11 0 0 0k 3k 11 5 1 0 7 2 0 1 0 0 0k 157 144 65 0 151 103 0 15 4 0 50k 21 4 0 0 15 0 0 0 0 0 Unknown 3k 4 2 0 0 4 0 0 0 0 0 0k 35 31 11 0 35 19 0 4 2 0 Unknown 34 32 2 0 34 11 0 0 0 0 Overall 20,031 740 116 0 1,891 181 0 57 9 0 Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 4
Table 5 - Imputation accuracy for different scenarios - Jersey Sire Dam No Correct Incorrect Missing Correct Incorrect Missing Correct Incorrect Missing 50k 477 99.14 0.86 0.00 98.80 1.20 0.00 99.67 0.33 0.00 50k 3k 298 97.57 2.20 0.23 97.18 2.57 0.25 98.75 0.97 0.29 0k 463 96.65 3.33 0.02 95.80 4.18 0.02 98.94 1.06 0.00 3k 3k 2 97.08 2.54 0.38 96.78 2.82 0.40 98.43 1.22 0.35 0k 3 95.85 4.13 0.02 94.79 5.19 0.03 98.73 1.22 0.05 50k 8 95.55 4.45 0.00 94.33 5.66 0.00 98.38 1.62 0.00 0k 3k 4 90.27 9.56 0.17 89.30 10.49 0.20 95.59 4.25 0.16 0k 25 92.04 7.95 0.01 90.50 9.47 0.04 97.44 2.55 0.01 Unknown 0k 6 92.68 7.32 0.00 91.19 8.81 0.00 97.97 2.02 0.01 Unknown 3 92.94 6.85 0.21 91.59 8.10 0.31 97.73 2.23 0.04 Overall 1,289 97.64 2.30 0.06 97.07 2.87 0.07 99.12 0.82 0.07 Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 5
Table 6 - Number of animals with high error rate and missing rate for different scenarios - Jersey Sire Dam No >5%Err >10%Err >5%Miss >5%Err >10%Err >5%Miss >5%Err >10%Err >5%Miss 50k 477 0 0 0 0 0 0 0 0 0 50k 3k 298 3 0 0 5 0 0 0 0 0 0k 463 30 3 0 87 3 0 3 2 0 3k 3k 2 0 0 0 0 0 0 0 0 0 0k 3 0 0 0 2 0 0 0 0 0 50k 8 1 1 0 4 1 0 0 0 0 0k 3k 4 3 2 0 4 2 0 2 0 0 0k 25 22 4 0 25 7 0 2 0 0 Unknown 0k 6 5 1 0 6 1 0 1 0 0 Unknown 3 3 0 0 3 0 0 0 0 0 Overall 1,289 67 11 0 136 14 0 8 2 0 Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 6
Table 7 - Imputation accuracy for different scenarios - Brown Swiss Sire Dam No Correct Incorrect Missing Correct Incorrect Missing Correct Incorrect Missing 50k 25 98.97 1.03 0.00 98.61 1.39 0.00 99.57 0.43 0.00 50k 3k 5 97.46 2.42 0.12 97.09 2.79 0.12 98.71 1.00 0.29 0k 164 96.75 3.24 0.01 95.89 4.11 0.00 99.00 1.00 0.00 0k 50k 1 95.85 4.13 0.02 95.41 4.57 0.02 99.00 1.00 0.00 0k 14 92.67 7.33 0.00 90.94 9.04 0.03 97.71 2.28 0.01 Overall 209 96.76 3.24 0.01 95.91 4.08 0.01 98.97 1.02 0.01 Table 8 - Number of animals with high error rate and missing rate for different scenarios - Brown Swiss Sire Dam No >5%Err >10%Err >5%Miss >5%Err >10%Err >5%Miss >5%Err >10%Err >5%Miss 50k 25 0 0 0 0 0 0 0 0 0 50k 3k 5 0 0 0 0 0 0 0 0 0 0k 164 5 1 0 27 1 0 1 0 0 0k 50k 1 0 0 0 0 0 0 0 0 0 0k 14 13 1 0 14 4 0 0 0 0 Overall 209 18 2 0 41 5 0 1 0 0 Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 7
Table 9 - Percentage of correctly imputed genotypes for each chromosome BTA Length No. Holstein Jersey Brown Swiss (Mbp) 3k 6k 3k+%1 6k 3k+1% 6k 3k+1% 6k 1 158.1 157 388 98.33 99.59 97.86 99.37 96.76 99.23 2 136.7 128 342 98.02 99.51 97.32 99.15 96.29 99.04 3 121.1 115 297 98.01 99.50 97.37 99.17 95.93 98.95 4 120.6 126 298 97.98 99.50 97.33 99.27 96.06 98.96 5 121.1 111 300 97.76 99.55 97.16 99.28 96.11 99.16 6 129.8 114 299 97.79 99.35 97.18 99.10 96.61 99.03 7 112.4 110 274 97.84 99.52 97.11 99.15 95.95 98.92 8 112.9 118 298 98.18 99.56 97.61 99.26 96.15 99.06 9 105.5 110 264 98.22 99.56 97.60 99.23 96.51 99.11 10 103.1 98 266 98.01 99.52 97.27 99.17 96.22 98.96 11 107 105 279 97.86 99.51 96.94 99.08 96.19 99.10 12 90.9 90 223 97.85 99.45 97.20 99.15 95.65 98.90 13 83.8 82 215 97.70 99.45 96.86 99.08 95.51 99.03 14 83.2 86 223 97.87 99.42 96.64 98.95 96.06 98.90 15 84.2 90 221 97.96 99.49 97.48 99.23 96.01 98.90 16 81.3 81 201 97.85 99.46 96.65 98.95 96.30 99.03 17 74.9 71 189 97.45 99.45 96.87 99.15 95.58 98.96 18 65.4 66 173 97.42 99.33 97.06 99.09 95.40 98.82 19 63.5 56 175 97.10 99.36 96.08 98.80 94.42 98.78 20 71.6 72 204 97.79 99.53 97.32 99.25 95.78 99.01 21 71.2 71 182 97.69 99.42 96.72 98.96 95.13 98.77 22 61.2 66 167 97.92 99.45 97.17 99.03 96.57 98.98 23 52.1 50 151 97.06 99.39 96.61 99.05 94.92 98.85 24 62.1 66 173 97.85 99.54 96.87 99.14 96.10 99.16 25 42.8 45 139 97.46 99.39 96.41 98.93 94.54 98.72 26 51 47 146 97.39 99.39 96.43 98.98 95.00 98.79 27 45.4 44 136 97.26 99.45 95.63 98.87 94.32 98.89 28 46.2 48 124 97.22 99.36 96.28 98.85 95.08 98.88 29 51.1 52 134 97.74 99.35 97.16 99.08 96.12 98.71 30 * 15.9 4 17 87.07 95.35 81.02 93.73 78.44 93.00 31 ** 143.8 135 203 98.42 99.49 98.06 99.17 97.70 99.29 All 2,669.8 2,614 6,701 97.81 99.47 97.07 99.12 95.91 98.97 * Pseudo autosomal region ** Sex specific region Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 8
Conclusions: Imputation accuracy using the 3k panel is very high when both parent are genotyped with the 50k panel and the gain from using the 6k panel is small in this case. The 6k panel resulted in substantially higher accuracy for animals with low family information especially for those with both parents missing or ungenotyped. The 6k panel worked much better in Brown Swiss and Jersey compared to the 3k panel mainly due to less family information in these two breeds. Dairy Cattle Breeding and Genetics Committee Meeting, September 13, 2011. 9