Linking a Statewide Assessment to the 2003 National Assessment of Educational Progress (NAEP) for 4 th and 8 th Grade Mathematics

Similar documents
2018 Linking Study: Predicting Performance on the NSCAS Summative ELA and Mathematics Assessments based on MAP Growth Scores

Linking the Virginia SOL Assessments to NWEA MAP Growth Tests *

Linking the Georgia Milestones Assessments to NWEA MAP Growth Tests *

Linking the Alaska AMP Assessments to NWEA MAP Tests

Linking the Kansas KAP Assessments to NWEA MAP Growth Tests *

Linking the Florida Standards Assessments (FSA) to NWEA MAP

Linking the New York State NYSTP Assessments to NWEA MAP Growth Tests *

Linking the North Carolina EOG Assessments to NWEA MAP Growth Tests *

Linking the Mississippi Assessment Program to NWEA MAP Tests

Linking the Indiana ISTEP+ Assessments to NWEA MAP Tests

College Board Research

Linking the Indiana ISTEP+ Assessments to the NWEA MAP Growth Tests. February 2017 Updated November 2017

Linking the PARCC Assessments to NWEA MAP Growth Tests

2018 Linking Study: Predicting Performance on the Performance Evaluation for Alaska s Schools (PEAKS) based on MAP Growth Scores

Investigating the Concordance Relationship Between the HSA Cut Scores and the PARCC Cut Scores Using the 2016 PARCC Test Data

DIBELSnet System- Wide Percentile Ranks for. DIBELS Next. Elizabeth N Dewey, M.Sc. Ruth A. Kaminski, Ph.D. Roland H. Good, III, Ph.D.

2018 Linking Study: Predicting Performance on the TNReady Assessments based on MAP Growth Scores

2016 Annual Statistical Report on the HiSET Exam

DIBELSnet Preliminary System-Wide Percentile Ranks for DIBELS Math Early Release

Student-Level Growth Estimates for the SAT Suite of Assessments

2017 Annual Statistical Report on the HiSET Exam

North Carolina End-of-Grade ELA/Reading Tests: Third and Fourth Edition Concordances

RESEARCH ON ASSESSMENTS

Busy Ant Maths and the Scottish Curriculum for Excellence Year 6: Primary 7

Busy Ant Maths and the Scottish Curriculum for Excellence Foundation Level - Primary 1

Gains in Written Communication Among Learning Habits Students: A Report on an Initial Assessment Exercise

American Driving Survey,

Norming Tables for the Student Testing Program (STP97)

Technical Manual for Gibson Test of Cognitive Skills- Revised

5. CONSTRUCTION OF THE WEIGHT-FOR-LENGTH AND WEIGHT-FOR- HEIGHT STANDARDS

Who has trouble reporting prior day events?

A REPORT ON THE STATISTICAL CHARACTERISTICS of the Highlands Ability Battery CD

GRADE 7 TEKS ALIGNMENT CHART

Correlation to the Common Core State Standards

Cost-Efficiency by Arash Method in DEA

FAMU Completers Satisfaction Survey Results 2010

Instructionally Relevant Alternate Assessments for Students with Significant Cognitive Disabilities

ARKANSAS DEPARTMENT OF EDUCATION MATHEMATICS ADOPTION. Common Core State Standards Correlation. and

2013 PLS Alumni/ae Survey: Overall Evaluation of the Program

An Evaluation of the Relationship between the Seat Belt Usage Rates of Front Seat Occupants and Their Drivers

Effect of Sample Size and Method of Sampling Pig Weights on the Accuracy of Estimating the Mean Weight of the Population 1

Houghton Mifflin MATHEMATICS. Level 1 correlated to Chicago Academic Standards and Framework Grade 1

The Learning Outcomes are grouped into the following units:

1 Background and definitions

Investigation of Relationship between Fuel Economy and Owner Satisfaction

Institutional Research and Planning 440 Day Hall Ithaca, New York PULSE Survey

PVP Field Calibration and Accuracy of Torque Wrenches. Proceedings of ASME PVP ASME Pressure Vessel and Piping Conference PVP2011-

DRIVER SPEED COMPLIANCE WITHIN SCHOOL ZONES AND EFFECTS OF 40 PAINTED SPEED LIMIT ON DRIVER SPEED BEHAVIOURS Tony Radalj Main Roads Western Australia

Statistics and Quantitative Analysis U4320. Segment 8 Prof. Sharyn O Halloran

Missouri Learning Standards Grade-Level Expectations - Mathematics

TRINITY COLLEGE DUBLIN THE UNIVERSITY OF DUBLIN. Faculty of Engineering, Mathematics and Science. School of Computer Science and Statistics

The application of the 95% Confidence interval with ISAT and IMAGE

Higher National Unit Specification. General information for centres. Electrical Motors and Motor Starting. Unit code: DV9M 34

2010 National Edition correlated to the. Creative Curriculum Teaching Strategies Gold

KEY STAGE. Level threshold tables and age standardised scores for key stage 2 tests in English, mathematics and science KEY STAGE KEY STAGE KEY STAGE

Appendix C: Model Contest Judging Guidelines

June Safety Measurement System Changes

Replacing the Volume & Octane Loss of Removing MTBE From Reformulated Gasoline Ethanol RFG vs. All Hydrocarbon RFG. May 2004

9.3 Tests About a Population Mean (Day 1)

2012 IECEE CTL PTP Workshop. Ingrid Flemming IFM Quality Services Pty Ltd

Inquiry-Based Physics in Middle School. David E. Meltzer

Cluster Knowledge and Skills for Business, Management and Administration Finance Marketing, Sales and Service Aligned with American Careers Business

Orientation and Conferencing Plan Stage 1

Missouri Seat Belt Usage Survey for 2017

Scholastic s Early Childhood Program correlated to the Kentucky Primary English/Language Arts Standards

Technical Papers supporting SAP 2009

2009 Community College of Student Engagement (CCSSE) College Results: Frequency Distributions

Conduct on-road training for motorcycle riders

Pros and cons of hybrid cars

Vehicle Scrappage and Gasoline Policy. Online Appendix. Alternative First Stage and Reduced Form Specifications

University Of California, Berkeley Department of Mechanical Engineering. ME 131 Vehicle Dynamics & Control (4 units)

Certificate in a vocational program

Scholastic s Early Childhood Program Correlated to the Minnesota Pre-K Standards

Trip Wire. Category: Physics: Electricity & Magnetism. Type: Make & Take Rough Parts List:

Correlation to the. Common Core State Standards. Go Math! 2011 Grade K

Vehicle Replacement Policy - Toronto Police Service

Interim Evaluation Report - Year 3

Traffic Signal Volume Warrants A Delay Perspective

MIT ICAT M I T I n t e r n a t i o n a l C e n t e r f o r A i r T r a n s p o r t a t i o n

We trust that these data are helpful to you. If you have any questions, feel free to contact Dr. Joe Ludlum at or

correlated to the Virginia Standards of Learning, Grade 6

Enrollment and Educator Data ( School Year) About the Data

Auto Service Technician

We will continuously update the results

MAGNETIC LEVITATION VEHICLES

REMOTE SENSING DEVICE HIGH EMITTER IDENTIFICATION WITH CONFIRMATORY ROADSIDE INSPECTION

Correlation to the New York Common Core Learning Standards for Mathematics, Grade 1

LONG RANGE PERFORMANCE REPORT. Study Objectives: 1. To determine annually an index of statewide turkey populations and production success in Georgia.

Road Safety s Mid Life Crisis The Trends and Characteristics for Middle Aged Controllers Involved in Road Trauma

PROCEDURES FOR ESTIMATING THE TOTAL LOAD EXPERIENCE OF A HIGHWAY AS CONTRIBUTED BY CARGO VEHICLES

Northwest Residential Electric Bills

AUTO 140A: VEHICLE MAINTENANCE

Agenda. 5:00 pm. IRGR 2010 Summary. 5:20 pm. Best Ranked Companies. Q&A and Closing Remarks. 5:50 pm

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Enrollment and Educator Data ( School Year) About the Data

ESSAYS ESSAY B ESSAY A and 2009 are given below:

CITY OF EDMONTON COMMERCIAL VEHICLE MODEL UPDATE USING A ROADSIDE TRUCK SURVEY

International Aluminium Institute

Rates of Motor Vehicle Crashes, Injuries, and Deaths in Relation to Driver Age, United States,

NO. D - Language YES. E - Literature Total 6 28

Transcription:

Linking a Statewide Assessment to the 2003 National Assessment of Educational Progress (NAEP) for 4 th and 8 th Grade Mathematics Liru Zhang Patsy Kersteter Katia Foret Delaware Department of Education Shudong Wang Harcourt Assessment, Inc. Paper presented at the 2007 National Council of Measurement in Education Annual Conference, Chicago, IL, April 10-12, 2007 1

Linking a Statewide Assessment to the 2003 National Assessment of Educational Progress (NAEP) for 4 th and 8 th Grade Mathematics Background The No Child Left Behind Act of 2001 (NCLB) requires annual testing in reading (or language arts) and mathematics developed by each state and sets goals of having all children at the proficiency level or higher by 2013-2014. The President s blueprint further declares that progress on state assessments will be confirmed by state results on NAEP for 4 th and 8 th grade students. To explore the new role of NAEP, the National Assessment Governing Board (NAGB) established a committee in 2001. The Committee recognized the diverse landscape of existing statewide assessments in design and difficulty level and identified measurement issues and technical challenges. The Ad Hoc Committee Report (2002) concluded that, through a careful review of NAEP s capacity and test results of eight states, NAEP can serve effectively as a source of confirmatory evidence for state test results. This effort reflects a broad, underlying interest by federal and state policymakers and many Americans to know how each state is performing in relation to high national or international benchmarks of student academic achievement. Linking test scores from different assessments through statistical procedures, such as calibration, projection, and moderation, must satisfy certain requirements to support interpretable and valid comparisons (Mislevy, 1992; Linn, 1993; Ercikan, 1998). The accuracy of such kinds of linkage strongly depends on the context of the assessments, the groups used for calculating statistics, and the time of administering the tests (Linn, 1993). Kolen and Brennan (2004) indicate thinking about linking in terms of degree of similarity of inferences, constructs, populations, and measurement characteristics and conditions. Perhaps the most distinct feature of the four degrees of similarity is its explicit incorporation of inferences, such as intended testing purpose, score reporting, and the stakes associated with a test. They suggest that all important topics must be addressed in conducting a linking study, from data collection design, statistical methods employed, to related assumptions. The most successful example is linking two nationally known college admission tests, ACT and SAT-I scores in 1990s (Dorans et al, 1997; Dorans, 1999). The study included 100,000 students who took both tests and eight circles of linkages. They differentiated three classes of statistical correspondence: equivalence, concordance, and prediction, depending on rational content considerations and empirical statistical relationships for the sub-tests (e.g., ACT Reading, SAT-I Verbal) and determined which type of correspondence was best suited for different scores and composite scores. In a general statement, they concluded that prediction should be used for this case. Most importantly, the two linked tests must measure similar constructs; otherwise, scaling is merely a mathematical operation applied to two sets of data to match test score distributions. 2

Over a decade, researchers have investigated the feasibility of linking distinct assessments and related psychometric issues. Linn and Kiplinger (1994) used the equipercentile procedures to link four state results on standardized tests to the NAEP 8 th grade mathematics. The results suggested that different linking functions for subgroups were due to the substantial content discrepancies between standardized tests and NAEP frameworks. Similar efforts were reported by Ercikan (1998) and Johnson et al (1998, 2002). The authors recognized the importance of overlapping content coverage for the quality of linkage and recommended using single state data and common group design to improve linking accuracy. A study by Zhang and Lau (2003) used a common group design and three-level content links to provide a basis for establishing the relationships between a state test and the 1999 TIMSS mathematics. The results showed similar conversions between male and female and relatively small linking errors. Purpose of the Study Methods of the Study The present study was an investigation to link test scores on a statewide assessment to the 2003 National Assessment of Educational Progress (NAEP) for 4 th and 8 th grade mathematics to facilitate interpretable comparisons of test results and assemble external validity evidence for student performance on the state assessment. The accuracy of linking and the property of invariance between genders were examined. Measurement issues of using NAEP data to validate state test results through linking procedures were discussed. Assessments The Delaware Student Testing Program (DSTP) is a mandated, standards-based assessment. DSTP is given annually to students at grade 2 through 10 in mathematics. Test scores are reported on a developmental scale, ranging approximately from 150 to 800 across grades. For grade 8, five performance levels were used in 2003 to determine student progress toward the standards and high-stakes accountability: Well Below the Standard, Below the Standard, Meets the Standard, Exceeds the Standard, and Distinguished. For grade 4, three levels were reported: Unsatisfactory and Satisfactory, with a narrow error band for Warning, for instructional improvement (Table 1). The 2003 NAEP state assessment was the first administration under the NCLB with all 50 states participation. To allow maximum coverage of mathematics abilities, while minimizing the time burden for individual students, NAEP used matrix sampling of items and a procedure for distributing blocks across test booklets that controlled for position and context effects. Five plausible values (PV) were used to estimate student performance on a scale ranging from 0 to 500 and four achievement levels were reported for each grade (Table 1). The four levels are Below Basic, Basic, Proficient, and Advanced. 3

Test Administration NAEP mathematics was administered during 6-week testing window from late January until early March. A random sample of Delaware students was selected to participate in the 2003 NAEP mathematics at grades 4 and 8. Test scores were weighted to accurately reflect the student demographics for reporting at the state level. NAEP was administered in two 25-minute blocks of cognitive testing in mathematics and one 20- minute background section on content related questions and family background. The DSTP mathematics was administered in April 2003 to all public school students in grades 4 and 8. The assessment had three sessions including one calculator session in two-day testing time. Students had one hour for each session and extended time was allowable as needed. Samples of Students Two NAEP data files were received from the National Center of Educational Statistics (NCES) for Delaware public schools; 3,140 students for grade 4 and 2,455 students for grade 8 (called NAEP sample). In the NAEP sampling process, schools were stratified first on the location of the school and second on minority characteristics of the student population. To inspect possible sampling variations on student achievement, two DSTP data files were generated for each grade. The first data file included the population of 7,731 students in grade 4 and 9,467 students in grade 8 (called DE population); the second data file included students from NAEP sampling schools since individual identification was not available from the NAEP data file, 7,118 for grade 4 and 7,491 for grade 8 (called DE sample). Those students participated in the 2003 state assessment and received a valid score in mathematics for aggregation. Table 2 shows that 84% of elementary schools and 48% of middle schools participated in the 2003 NAEP. Compared with the student population, there were 1% more female students in the NAEP sample for both grades and 3% less White students for grade 4 and 1% less for grade 8. The variations may be due to the differences between enrollment and participation of assessments, accommodations, and aggregation rules. Methods and Process This study employed the common group design to link the DSTP scale scores to the NAEP plausible values by using equipercentile procedures. Since student identification was not available from NAEP, the Delaware sample matched the NAEP sample at the school level under the assumption of the equivalent group of students who took both NAEP and DSTP. The analysis was composed of two phases: (1) The three-level content link between DSTP and NAEP was generally conducted involving the comparisons of the State Content Standards and NAEP 4

Frameworks in mathematics, test specifications, and a brief analysis of sample items. As indicated in the Ad Hoc Committee Report (2002), NAEP can be relied upon as a solid source of confirmatory evidence because there is sufficient correspondence in content coverage between NAEP and state tests and a high degree of overlap is possible. (2) The statistical linkage was performed using unsmoothed equipercentile procedures described by Kolen and Brennan (1995, 2004). To examine the property of invariance, independent linking was performed for male and female students separately. The process for statistical linkage is described below: Link the distribution of the DSTP scale scores to the distribution of the mean plausible values (PL) of NAEP for grade 4 and grade 8. Compare the independent linking functions obtained from the two DSTP data files: the Delaware grade population and Delaware sample that matched the NAEP sampling schools for grades 4 and grade 8. Compare the average linking functions obtained from the NAEP five plausible values with the linking function obtained from the mean of the five plausible values for grade 4. Compare the independent linking functions for male and female students with the linking functions obtained from the total sample to examine the invariance of linking for grade 4 and grade 8. The standard errors of linking were estimated by utilizing the formula below by Petersen, Kolen & Hoover (1993, p.251); where φ is the ordinate of the standard normal density at the unit-normal score of z, below which p of the cases fall. The δ 2 is the variance of Y-test; n is the sample size. SE [e y (y x )] = δ pq 1 ) + 1 φ nx n y 2 y ( 2 Two statistics, standardized Root Mean Square Difference (RMSD) and standardized Root Expected Mean Square Difference (REMSD) by Dorans and Holland (2000) were calculated between the transformation functions obtained from subgroups and the population to examine the invariance of linking for subgroups. RMSD(y) = j w [ e ( y) e j p j δ xp p ( y)] 2 (1) 5

REMSD(y) = j w E {[ e j p p δ j ( y) e xp 2 ( y)] } p (2) Where e p (y) represents scores of test Y to the scale of test X for the total group e p represents scores of test Y to the scale of test X e pj represents scores of test Y to the scale of test X for subgroup p j w j is the ratio of the subgroup to the total group δ xp is the standard deviation of test X Content Link Results and Discussion The five content areas that constituted the 2003 NAEP mathematics assessment, which applied to grades 4 and 8, were: Number Sense, Properties, and Operations; Measurement; Geometry and Spatial Sense; Data Analysis, Statistics, and Probability; and Algebra and Functions. There were three mathematical dimensions: Conceptual Understanding; Procedural Knowledge; and Problem Solving. The six content domains measured by the DSTP mathematics assessment were: Estimation, Measurement, and Computation; Number Sense; Algebra; Spatial Sense and Geometry; Patterns, Relationship, and Functions at three performance levels: Conceptual Knowledge; Procedural Knowledge; and Mathematical Process (Problem Solving). The results of preliminary content link suggested that there was considerable overlap between the State Content Standards and NAEP Frameworks. Both were primarily founded on the National Council of Teachers of Mathematics Curriculum and Evaluation Standards for School Mathematics (1989) for the 2003 assessments. Tables 3a and 3b display the target percentage distribution of items by content category and grade and the three types of items used in both assessments: multiplechoice, short-answer, and extended constructed-response. One of the major differences as seen in Table 3a was due to the grouping of objectives measured and the title used for the category. For example, if we group Estimation, Measurement, and Computation with Number Sense for Delaware Standards, we would obtain almost the same percentage for DSTP (41%) as NAEP (40%) in grade 8 when combining the first two standards. A similar grouping for the standard of Algebra and the standard of Patterns, Relationships, and Functions for DSTP (16%) in grade 4 was nearly equivalent to the single standard of Algebra and Functions for NAEP (15%). In general, NAEP and DSTP appeared very similar in the emphasis and proportion with combined standards. For example in grade 4, over one half of the test content came from two standards: Estimation, Measurement, and Computation; and Number Sense for both NAEP (60%) and DSTP (53%); in grade 8, the majority of the test content came from the combination of Estimation, Measurement, and Computation; and Number Sense (40% for NAEP; 41% for DSTP); and the combination of Algebra; with Patterns, Relationships, and Functions (40% for NAEP; 31% for DSTP). 6

In contrast, the proportion of each measured standard seemed to be different between the two assessments. For example, 7% more items on Estimation, Measurement, and Computation; and Number Sense for NAEP than that for DSTP in grade 4; 10% more items on Spatial Sense and Geometry for NAEP than that for DSTP in grade 8. Moreover, NAEP and DSTP had the same labels and descriptions for three cognitive categories. DSTP had 40% Conceptual Knowledge, 40% Procedural Knowledge, and 20% Problem Solving. However, the NAEP framework did not specify the percentage of items in the three mathematical abilities in practice. The item-level analysis offered detailed information about the features of test items and scoring rubrics. Both NAEP and DSTP used the same item types and similar context to measure mathematical concept and skills (Table 3b). Since the number of items for each item type by test booklet was not available for NAEP, the percentage of item format between the two assessments could not be compared. The discrepancy of performance expectations, however, was observed in the process of item review, which seemed to be from the different objectives of the target student populations. At the itemlevel, we found that the majority of the DSTP test items measured on-grade mathematical knowledge and skills for students of grade 4 and of grade 8; while some NAEP questions are given to students at more than one grade or age level. These questions are referred to as, for example, between grade 4 and grade 8 (NAEP Cross Grade Questions Information, NAEP Questions Tool Help, 2007). Appendix A-1 and A-2 include released sample items from DSTP and NAEP for grades 4 and 8. Sample items selected for each grade measured the same content standards in the same item format. Two MC items for grade 4 measure the content standards of Spatial Sense; two measure Probability. The two MC items for grade 8 measure statistics and interpretation of graphs; the two constructed-response items measure algebra. Statistical Link. The statistical linkage was performed by using equipercentile procedures. The linking function is an equipercentile linking function if the score distribution on X-test converted to the Y-test scale is equal to the score distribution on Y-test in the population (Kolen & Brennan, 1994). In this study, the equipercentile linking function was developed by identifying the scale score on the DSTP that had the same percentile rank as the plausible value on the NAEP scale. In case no student earned a particular score on a distribution and the corresponding percentile rank might not be unique, the median was chosen. Descriptive statistics were calculated by grade for every sub-group. The Standardized Mean Difference (SMD) by gender provided a scale-independent way to quantify group mean difference. Two statistics were used to examine the population invariance of linking for males and females: the standardized Root of Mean Square Difference (RMSD) is associated with a particular score; while the standardized Root of Expected Mean Square Difference REMSD summarizes the overall difference for the entire group. 7

(1) Link for Grade 4 Descriptive statistics for the DSTP by the Delaware sample that matched the NAEP sampling schools and the Delaware population and NAEP plausible values are presented in Table 4. The relative frequency distributions displayed in Figures 1 and 2 suggest that the DSTP scores are slightly negative skewed; while the mean plausible values are nearly normally distributed. The plots in Figures 3 and 4 illustrate a slightly flat, but straight line between DSTP and NAEP scores, indicating a strong relationship. Table 5 shows the linking functions of the DSTP scale scores obtained from the sampling schools and the grade population to the mean of plausible values (MPL) from NAEP at selected percentile ranks. (Appendix B: Table 1 Conversion Table for Grade 4). Three NAEP cut scores were located at the percentile rank of 19, 69, and 98, respectively, and the linking functions on the DSTP scale were 425 for Basic, 470 for Proficient, and 533.5 for Advanced derived from the Delaware sample. The equivalents obtained from the sample and from the population were generally consistent with light discrepancies of no more than one score point between percentile ranks of 10 and 75. The largest difference of the linking functions was 2 score points at the percentile ranks of 80 and 90 between the sample and the population. These results were also compared with the linking functions and the equipercentile equivalents on the NAEP scale obtained from the average of five independent linkages (AIL) using plausible values. Scores at selected percentile ranks presented in Table 6 suggested that at the same locations on the score distribution the equivalent NAEP scores were identical for MPL and AIL between the percentile rank of 10 and 75. The equivalent NAEP scores were larger for MPL than that for AIL on the lower end of the score distribution with a maximum of 3 score points. In contrast, the equivalent NAEP scores derived from MPL and AIL were relatively consistent on the higher end of the score distribution with discrepancies up to 2 score points except at the 99 percentile rank. Using NAEP cut scores as reference for comparison, it is found that the equivalent NAEP score at the percentile ranks of 19 for Basic and 69 for Proficient were identical from both MPL and MIL. The linking function for Advanced (cut-score = 282) was 524.5 from AIL at the 97 percentile rank with an equivalent of 282 (rounded) on the NAEP scale; while the linking function was 533.5 from MPL at the percentile rank of 98 with an equivalent of 283 (as the nearest) on the NAEP scale. Descriptive statistics for the DSTP sample and total group and NAEP are displayed in Table 7 by gender. The positive values of the Standard Mean Difference (SMD) suggest that the grade 4 male students scored higher than female students in mathematics by only a trivial amount. Independent linking was conducted for male students and the results at selected percentile ranks are presented in Table 8a (Appendix B: Table 2. Conversion Table for Grade 4). The linking functions for males were found to be consistent to the linking functions for the total sample. The maximum difference of linking functions was up to 2.5 score points between the percentile ranks of 10 and 69. Larger discrepancies were observed at the percentile rank of 5 (4 score points) and below and an unexplained one 8

found at the percentile rank of 95 (5 score points). Using the NAEP cut scores as reference for comparison, it is found that the linking function for males was 1.5 score points lower than that for the total group at the percentile rank of 19 for Basic (cut score = 214); two score points higher for males at the percentile rank of 69 for Proficient (cut score = 249), and 1.5 score points higher than the total group at the percentile rank of 98 for Advanced (cut score = 282). In other words, male students need to score at the 20 percentile rank on the DSTP scale rather than at the 19 percentile rank to achieve the NAEP Basic; score at the 66 percentile rank to receive an equivalent DSTP score of 469 instead of 470 to be Proficient, according to the relationship between NAEP and DSTP in this study. To identify the equivalent DSTP score for the NAEP Advanced level seemed to be ambiguous for males since the cut score (282) fell between the percentile ranks of 97 and 98 and a span of DSTP scores (525.5 to 534) equivalent to the same region. Moreover, the linking functions obtained from the Delaware sample and Delaware population were found relatively consistent with discrepancy up to 1.5 score points; however, the linking functions were unexplained at the higher percentile ranks of 97 (7.5), 98 (8.5), and 99 (6.5) for the Delaware population than that for the sample. Independent linking was conducted for female students and the results at selected percentile ranks are displayed in Table 8b (Appendix B: Table 3. Conversion Table for Grade 4). The linking functions for females were found consistent to that for the total group between the percentile ranks of 10 and 69 with the discrepancies up to 1.5 score points. At the lower end of the distribution, the linking functions were higher for females than that for the total sample, up to 3 score points; in the low-middle region of the distribution (PR = 25 to 45), the linking functions are identical; whereas, at the percentile rank of 50 and above, the linking functions for females turned out to be lower than that for the total sample with the discrepancies up to 3 score points. At the top percentile ranks (PR = 97 to 99), however, the discrepancies between females and the total group became unexpectedly large. Using the NAEP cut scores as reference for comparison, it is found that the linking function for female students was one score point higher than that for the total group for Basic (PR = 19); whereas, the linking functions were 2.5 score points lower for Proficient (PR = 69) and 3 score points lower than the total group for Advanced (PR = 98). In other words, female students need to score at the 72 percentile rank with an equivalent DSTP score of 471 instead of 470 to be Proficient according to the relationship between NAEP and DSTP in this study. Moreover, the linking functions derived from the Delaware sample and grade population were relatively consistent with the difference of one score point between the percentile ranks of 1 and 85 and the differences went up to 7.5 score points at the 98 percentile rank. In Table 9, the standard error of linking was estimated for grade 4 based on the assumption that normality of test scores from DSTP and NAEP was approximately achieved. The standardized REMSD was.4137 (41% out of a standard deviation) for males and.5751 (58% out of a standard deviation) for female; which suggest that the overall linking results seemed to be invariant for male students in grade 4, but slightly different for female students according to the commonly used criterion of.50. The standardized RMSD had a range of 0 to.0965 for males and 0 to.2270 for females. At the percentile rank of 19, the standardized RMSD was.03895 for males and.018924 for 9

females for NAEP Basic; at the percentile rank of 69, the standardized RMSD was.03860 for males and.03784 for females for Proficient. The statistics, however, increased to.0097 for males and.1513 for females at the percentile rank 98 for NAEP Advanced. (2) Link for Grade 8 Descriptive statistics for the DSTP by the Delaware sample that matched the NAEP sampling schools and the Delaware population and NAEP plausible values are presented in Table 10. The relative frequency distributions displayed in Figures 5 and 6 suggest that the DSTP scores are slightly positively skewed; while the mean plausible values are nearly normally distributed. The plots in Figures 7 and 8 illustrate a straight line between DSTP and NAEP scores with slight curve, indicating a strong relationship. Table 11 shows the linking functions of the DSTP scale scores obtained from the sampling schools and the grade population to the mean of plausible values from NAEP at selected percentile ranks. (Appendix C: Table 1. Conversion Table for Grade 8). Three NAEP cut scores were located at the percentile rank of 30, 74, and 96, respectively, and the equipercentile linking functions on the DSTP scale were 473 for Basic, 519.5 for Proficient, and 572 for Advanced based on the Delaware sample. The equivalents obtained from the Delaware sample and from the population were consistent at and below the percentile ranks of 40 with trivial discrepancies. The discrepancy increased to 3 score points above the 40 percentile ranks and even up to 4 score points at the percentile rank of 95. Descriptive statistics for the DSTP sample and total group and NAEP are displayed in Table 12 by gender. The positive values of the Standard Mean Difference (SMD) suggest the grade 8 male students scored higher than their peers in mathematics by a trivial amount. Independent linking was conducted for male students and the results at selected percentile ranks are displayed in Table 13a (Appendix C: Table 2. Conversion Table for Grade 8). The linking functions for male students seemed to be consistently aligned to the linking functions for the total sample except at the top end of the distribution, 95 to 99 percentile ranks. Using the NAEP cut scores as reference for comparison, it is found that the linking function for males was identical to the total group at the percentile rank of 30 for Basic (cut score = 262); one score point higher for males at the percentile rank of 74 for Proficient (cut score = 299) than the total group. For the Advanced level, the linking function for males increased to 3 score points higher than that for the total sample. In other words, male students need to score at the 70 percentile rank and an equivalent DSTP score of 514 instead of 519.5 to be Proficient according to the relationship between NAEP and DSTP in this study. To identify the equivalent DSTP score for the NAEP Advanced level seemed to be ambiguous for males since the cut score (333) fell into between the percentile ranks of 95 and 96 and a span of DSTP scores (568.5 to 575) equivalent to the same region. Moreover, the linking functions for male students obtained from the Delaware sample and population were consistent but with larger 10

discrepancies than the total sample. For example, at the percentile rank of 74, the difference was 3 score points; at the percentile rank of 96 the difference increased to 4.5 score points. Independent linking was conducted for female students and the results at selected percentile ranks are displayed in Table 13b (Appendix C: Table 3. Conversion Table for Grade 8). The linking functions for females were found identical or nearly identical to that for the total group between the percentile ranks of 25 and 50. At the lower end of the distribution, discrepancies were found up to 2.5 score points higher for females than that for the total sample; while the linking functions turned out to be lower for females above the middle of the distribution of scores. At the percentile rank of 97 and above, the discrepancies between females and the total group became unexpectedly large. Using the NAEP cut scores as reference for comparison, it is found that the linking function was identical between females the total sample for the Basic level (PR = 30; cut score = 262); two score points lower for females than that for the total sample at the percentile rank of 74 (cut score = 299) for Proficient; and the linking function became 4 score point lower for Advanced (PR = 96; cut score = 333) than the total sample. In other words, female students need to perform at the 75 percentile rank to be Proficient and at the 97 percentile rank to earn the level of Advanced according to the relationship between NAEP and DSTP in this study. Moreover, the linking functions derived from the Delaware sample and grade population were relatively consistent with discrepancies from.5 up to 3.5 at the 74 and 85 percentile ranks. In Table 14, the standard error of linking was estimated for grade 8 based on the assumption that normality of test scores from DSTP and NAEP was approximately achieved. The standardized REMSD was.3189 (32% out of a standard deviation) for males and.4779 (48% out of a standard deviation) for female; which suggest that the overall linking results seemed to be invariant for male and female students according to the commonly used criterion of 0.50. The standardized RMSD had a range of 0 to 1.044 for males and 0 to 1.613 for females. At the percentile rank of 30, the standardized RMSD was 0.0 for NAEP Basic; at the percentile rank of 74, the standardized RMSD was 0.00144 for males and 0.0533 for females for Proficient. The statistics, however, increased to 0.13 (13%) for males and 0.2133 (21%) for females at the percentile rank 96 for NAEP Advanced. (3) Validation Did the conversion tables developed based on 2003 DSTP scores and the 2003 NAEP mean plausible values reflect the relationship between the statewide assessment and NAEP in mathematics for grades 4 and 8? To what extent can the conversion tables be used to estimate Delaware student performance on NAEP? The validation included two ways. First, we used the DSTP scores to estimate the NAEP average scores by using the corresponding conversion tables (Appendix B for grade 4; Appendix C for Grade 8) for grades 4 and 8. Table 15 contains the actual state average scores in mathematics, the estimated NAEP average scores, the actual NAEP average scores from the NAEP report, and the residual between estimated and actual NAEP scores for the total group by gender 11

for 2003 and 2005. The results show that the residual ranged from 0.5 to 2 score points for grade 4 and from 2 to 3 score points for grade 8. The maximum size of the residual was about 1% to 8% of a standard deviation. Second, we used the DSTP equivalents to the NAEP cut scores from the conversion table to estimate the percentage of Delaware students at the NAEP proficiency levels. Table 16 contains the corresponding DSTP scores to the three NAEP cut scores for Basic, Proficient, and Advanced; the estimated locations of the DSTP equivalents on the 2005 distributions of scores (Appendix D) and the actual locations on NAEP scale from NAEP report; the estimated percentage of Delaware students at each proficiency level and the actual percentage of students at each proficiency level; and the residuals between the estimated and the actual percentages for the total group and by gender. The discrepancies of percentage of students at each level ranged from -3% to +3% for grade 4; from -2% to +3% for grade 8. Summary of Findings There has been considerable interest by policy makers, educators, and the general public in the confirmation of student progress on state assessments with NAEP s results, particularly due to the high-stakes accountability requirements under NCLB. The present study was to investigate the feasibility and accuracy of linking state test scores to NAEP in mathematics for grades 4 and 8 to facilitate interpretable comparisons of the test results. As indicated in the Standards for Educational and Psychological Testing that analyses of the relationship of test scores to variables external to the test provide another important source of important source of validity evidence (1999, p.13). (1) Many researchers pointed out that only if the two linked tests measure similar constructs could scores on one test provide a meaningful source of confirmatory evidence for another test. The content links revealed that there was considerable overlap between the State Content Standards and the NAEP Frameworks as both were primarily founded on the NCTM Curriculum and Evaluation Standards for School Mathematics (1989). The similarities between DSTP and NAEP were also observed in test specifications, types of items, and item context. Both DSTP and NAEP were administered under standardized conditions within a testing window from January to April of 2003. (2) The results of analysis suggested that the property of population invariance was reasonably attained for male and female students. The standardized REMSD was smaller than 0.50 for males of grade 4 and for males and females of grade 8. The standardized REMSD of 0.5751 (58% out of one standard deviation) indicated that the linking results were slightly biased for grade 4 females. (3) The validation study provided encouraging evidence to support the use of the conversion tables that were based on the relationships established in this study between DSTP and NAEP scores through a linking procedure. The differences between estimated and actual average NAEP scores were in an acceptable range from 0 to 3 score points; the discrepancies between estimated and actual percent of students falling into each NAEP category were up to 3%. 12

(4) The issue that no student earned a particular score on a distribution, thus the corresponding percentile rank became not unique, appeared to be a big challenge in the linking process of the current study. The use of middle score was a subjective choice; which might contribute to the inaccuracy of linking results, especially in the extreme ends of distribution. Using smoothing procedures seems worthwhile to explore for future studies to improve the linking quality. (5) Factors that might contribute to the inaccuracy of the linking results, such as dissimilarities of the two assessments (test length and scale of test scores), variations between NAEP sample and the grade population (sampling procedures, accommodations, and aggregation rules), and student motivation were recognized in the current analysis and could possibly be improved in the future. (6) Most importantly, the linking results strongly support the academic achievement of Delaware students from 2003 to 2005 in mathematics with confirmative evidence provided by the results of NAEP for grades 4 and 8. The findings of this study indicate substantial possibilities for states that desire to validate student performance on local assessments with high, national benchmarks like NAEP. 13

References Brennan, R.L. (2004). Manual for LEGS version 2.0. Center for Advanced Studies in Measurement and Assessment. CASMA Research Report, No. 3. Dorans, N.J., Lyu, C.F., Pommerich, M., & Houston, W.M. (1997). Concordance between ACT Assessment and Recentered SAT I Sum Scores. College and University, 73(2), 24-34. Dorans, N.J. (1999). Correspondences between ACT and SAT I Scores. College Board Report No. 99-1; ETS RR No. 99-2. Dorans, N.J. and Holland, P.W. (2000). Population Invariance and the Equatability of Tests: Basic Theory and the Linear Case. Research Report, ETS. Ercikan, K (1997). Linking statewide tests to the National Assessment of Educational Progress: Accuracy of combining test results across states. Applied Measurement in Education, 10, 145-160. Johnson, E.G., and Owen, E. (1998). Linking the National Assessment of Educational Progress (NAEP) and the Third International Mathematics and Science Study (TIMSS): A technical report (Publication No. NCES 98-499). Washington, DC: National Center for Education Statistics. Johnson, E.G., Siegendorf, A., and Phillips, G.W. (1998). Linking the National Assessment of Educational Progress and the Third International Mathematics and Science Study: Eighth grade results (Publication No. NCES 98-500). Washington, DC: National Center for Education Statistics. Johnson, E.G. (2002). Linking NAEP 2000 to TIMSS 1999. Paper presented at the 2002 AERA/NCME Annual Conference, New Orleans, LA. Kolen, M.J. and Brennan, R.L. (1995). Test equating Methods and practices. Springer. Kolen, M.J. and Brennan, R.L. (2004). Test equating, scaling, and linking: Methods and Practices (2 nd Edition). Springer. Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6, 83-102. Linn, R.L. and Kiplinger, V.L. (1994). Linking statewide tests to the National Assessment of Educational Progress: Stability of results (CSE Technical Report 375). National Center for Research on Evaluation, Standards, and Student Testing. 14

Mislevy, R.J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, NJ: Policy Information Center, Educational Testing Service. NAEP 2003 Mathematics Report Card. NCES. National Council of Teachers of Mathematics (1989). Standards for School Mathematics. Curriculum and Evaluation Peterson, N.S., Kolen, M.J., and Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed), Educational Measurement (3 rd ED). New York: Macmillan. Standards for Educational and Psychological Testing (1999). American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Using the National Assessment of Educational Progress to confirm state test results. A Report of the Ad Hoc Committee on Confirming Test Results, Attachment B. March 1, 2002. Zhang, L.R. and Lau, Allen (2003). Linking a statewide assessment to the 1999 TIMSS for eighth grade mathematics. Paper presented at the 2003 AERA/NCME Annual Conference, Chicago, IL, April 21-25, 2003. 15

Table 1. Achievement Levels and Cut Scores by Grade Assessment Achievement Level Achievement Level Grade 4 Cut Score Grade 8 Cut Score DSTP Warning* 425 Below the Standard 469 Satisfactory 433 Meets the Standard 493 Exceeds the Standard 531 Distinguished 549 NAEP Basic 214 Basic 262 Proficient 249 Proficient 299 Advanced 282 Advanced 333 * Test scores for grades 4 were used for instructional improvement in 2003. The achievement levels were primarily set as Satisfactory and Unsatisfactory with a narrow error band for Warning. 16

Table 2. Percentage of Schools and Subgroups of Students Subgroup DSTP NAEP** Population Sample* Sample Grade 4 School 106 (n) 0.84 0.84 Female 0.49 0.49 0.50 Male 0.51 0.51 0.50 Black 0.32 0.32 0.34 White 0.59 0.59 0.56 Grade 8 School 75 (n) 0.48 0.48 Female 0.48 0.48 0.49 Male 0.52 0.52 0.51 Black 0.34 0.33 0.30 White 0.60 0.58 0.61 * The DSTP sample includes all students from the NAEP sampling schools. ** The NAEP sample includes students who were randomly selected using stratification sampling procedures that is briefly described in the proposal. 17

Table 3a. Target Percentage Distribution of Items by Content Category and Grade DSTP Grade 4 Grade 8 NAEP Grade 4 Grade 8 Content Category N. % N. % Content Category N.* % N. % Estimation, Measurement, 18 32 14 23 Number Sense, 40 25 and Computation Properties, and Operations Number Sense 12 21 11 18 Measurement 20 15 Spatial Sense and 10 18 6 10 Geometry and Spatial 15 20 Geometry Sense Statistics and Probability 8 14 11 18 Data Analysis, Statistics, 10 15 and Probability Algebra 4 7 12 20 Algebra and Functions 15 25 Patterns, Relationships, 5 9 7 11 and Functions Total 57 100 61 100 Total 181 100 197 100 * The number of items in each content category by test booklet is not available for NAEP. The total number is for multiple test booklets. Source: U.S. Department of Education, Institute of Education Science, National Center for Education Statistics, National Assessment Progress (NAEP), 1990, 1992, 1996, 2000, and 2003 Mathematics Assessments. 18

Table 3b. Distribution of Items Administered by Item Type and Grade Question Format DSTP NAEP* Grade 4 Grade 8 Grade 4 Grade 8 Multiple-Choice 46 50 114 129 Short-Answer 8 8 59 58 Extended Constructed-Response 3 3 8 10 Total 57 61 181 197 * The number of items for each item type by test booklet is not available for NAEP. The total number of items is for multiple test booklets of NAEP. Source: U.S. Department of Education, Institute of Education Science, National Center for Education Statistics, National Assessment Progress (NAEP), 1990, 1992, 1996, 2000, and 2003 Mathematics Assessments. 19

Appendix A-1 Sample Items for Grade 4 20

NAEP Grade 4 Sample Item 1 This item measures Geometry and Spatial Sense Which of these shapes are cylinders? a. 1 and 2 b. 1 and 3 c. 2 and 4 d. 3 and 4 21

DSTP Grade 4 Sample Item 1 This item measures Geometry and Spatial Sense 22

NAEP Grade 4 Sample Item 2 This item measures Probability In a gumball machine there are 100 red, 75 blue, 50 green, and 125 yellow gumballs. These 350 gumballs are mixed up. Sam puts money in and one gumball comes out. Which color is most likely to come out? a. Red b. Blue c. Green d. Yellow DSTP Grade 4 Sample Item 2 This item measures Probability 23

Appendix A-2 Sample Items for Grade 8 24

NAEP Grade 8 - Sample Item 1 The pie chart above shows the portion of time Pat spent on homework in each subject last week. If Pat spent 2 hours on mathematics, about how many hours did Pat spend on homework altogether? a. 4 b. 8 c. 12 d. 16 Mathematical Content Area: Data analysis, statistics, and probability This question measures data analysis, statistics, and probability. This content area focuses on the skills of collecting, organizing, reading, representing, and interpreting data. These are assessed in a variety of contexts to reflect the use of these skills in dealing with information. Students are expected to use statistics and statistical concepts to analyze and communicate interpretations of data. Students are also expected to understand the meaning of basic probability concepts and applications of these concepts in problemsolving and decision-making situations. Mathematical Ability: Problem solving This question measures students' problem solving ability. Students demonstrate problem solving in mathematics when they recognize and formulate problems; determine the consistency of data; use strategies, data, models; generate, extend, and modify procedures; use reasoning in new settings; and judge the reasonableness and correctness of solutions. Problem solving situations require students to connect all of their mathematical knowledge of concepts, procedures, reasoning, and communication skills to solve problems. 25

DSTP Grade 8 Sample Item 1 The graph below shows the results of the election for class president. Which statement is supported by the information in the graph? F Sara received half of the votes. G The ratio of Chad s votes to Mary s votes is 2 to 5. H Mary received about 37.5 % of the votes. J Mary received ½ as many votes as Sara. 26

NAEP Grade 8 - Sample Item 2: While she was on vacation, Tara sent 14 friends either a letter or a postcard. She spent $3.84 on postage. If it costs $0.20 to mail a postcard and $0.33 to mail a letter, how many letters did Tara send? Show what you did to get your answer. Did you use the calculator on this question? Mathematical Content Area: Algebra and functions This question measures algebra and functions. This content area extends from work with simple patterns, to basic algebra concepts, to sophisticated analysis. Students are expected to use algebraic notation and thinking in meaningful contexts to solve mathematical and real-world problems, addressing an increasing understanding of the use of functions. Other topics assessed include using open sentences and equations as representational tools and using the notion of equivalent representations to transform and solve number sentences and equations of increasing complexity. Mathematical Ability: Problem solving This question measures students' problem solving ability. Students demonstrate problem solving in mathematics when they recognize and formulate problems; determine the consistency of data; use strategies, data, models; generate, extend, and modify procedures; use reasoning in new settings; and judge the reasonableness and correctness of solutions. Problem solving situations require students to connect all of their mathematical knowledge of concepts, procedures, reasoning, and communication skills to solve problems. Solution: 8 letters.20(6) +.33(8) = $3.84 Students may use a variety of strategies to solve this, including guess and check, formal algebra, or others. For example, 27

# of Postcards # of Letters Total Cost 1 13 4.49 2 12 4.36 3 11 4.23 5 10 4.10 5 9 3.97 6 8 3.84 7 7 3.71 8 6 3.58 OR x + y = 14.20x +.33y = 3.84 Therefore,.20x +.33(14 - x) = 3.84 so x = 6 and y = 8 Score & Description: Extended Correct Response Satisfactory Correct, complete process is indicated, but answer is not 8 and only has a minor computational error. OR Shows correct, complete process but does not indicate answer Partial Correct, complete process is indicated, but answer is not 8 and there are several computational errors (Process must clearly illustrate a correct strategy, such as a table or equations.) OR Correct response of 8 but shows no work or incomplete work Minimal Correct, complete process is indicated, but answer is not 8 and there are several computational errors (Process must clearly illustrate a correct strategy, such as a table or equations.) OR Correct response of 8 but shows no work or incomplete work Incorrect Incorrect response This question was a word problem that asked the student to consider two values the number of letters and the number of postcards even though the student was only asked for the number of letters. This question could be solved in several ways. A student could reason numerically to find the number of letters and the number of postcards, possibly by using a guess-and-check strategy or by creating a table. Another possibility was to set up and solve a system of two linear equations in two unknowns. To earn full credit, students needed to show how they obtained the answer. Students were permitted to use a calculator 28

DSTP Grade 8 sample Item 2: Val and Susan went shopping with $108.00. They wanted to buy sweatshirts and t-shirts for student council awards. T-shirts cost $9.00 and sweatshirts cost $12.00. Produce a chart or list to show all the different contributions of sweatshirts and t-shirts they can buy for exactly $108.00. Score points Response Attributes 2 Response contains all correct combinations* Response contains at least two of the correct combinations and an unsuccessful attempt is made to find the remaining ones. 1 OR Response indicates that a valid strategy was used but with minor computation errors. Response contains insufficient evidence of appropriate skills/knowledge 0 to successfully accomplish the task. Sample Solution: T-Shirt Sweatshirt Total Cost 0 9 12 0 $108 4 6 8 3 Note: Responses will vary in methods and they should carefully be examined case by case. For example, a student sets up an equation correctly labeled, but uses trial and error to determine some of the combinations, then this deserves a score point of 1. OR Students might realize that 3 is a multiple of 9 and 12 and proceed to decompose 108 as 9-12 s because 12 s can be broken down to obtain 9 s. 29

Table 4. Descriptive Statistics Grade 4 Grade Descriptive Statistics N. Range Min. Max Mean SD Variance Skewness Kurtosis DSTP Population 7731 303 319 622 454.5607295 36.370222 1322.79304 0.1522057 0.71118896 Sample 7118 303 319 622 453.9335487 35.989737 1295.26114 0.1403643 0.74727557 NAEP Plausible Value1 3124 172.5 134.38 306.88 235.8415845 25.706302 660.813945-0.1867944 0.00808769 Plausible Value2 3124 180.39 128.64 309.03 236.1626376 25.294143 639.793651-0.2616932 0.16656375 Plausible Value3 3124 180.48 135.40 315.88 235.8524232 25.504838 650.496738-0.1487831-0.06847261 Plausible Value4 3124 169.90 140.13 310.03 235.9314565 25.435945 646.98732-0.2124309 0.00752819 Plausible Value5 3124 180.05 128.43 308.48 236.1806466 25.321775 641.192283-0.2184781 0.07515276 Mean PL 3124 162.92 139.93 302.85 235.9937497 24.285345 589.777964-0.1794628-0.0542342 * The DSTP sample includes all students from the NAEP sampling schools. 30

Table 5. Scores Corresponding to Selected Percentile Ranks for Grade 4 PR NAEP DSTP Sample Population 1 174 366 364.5 5 196 394 395 10 204.5 409 409.5 15 210 418.5 419 18 213 423.5 423.5 19 214 425 425 20 215 426 426 25 220 432 432 30 223.5 437 437 35 227 441 441 40 230 445.5 446 45 233.5 449 449.5 50 236 453 454 55 239.5 457.5 458 60 243 461 462 65 246 466 467 66 247 467.5 468 69 249 470 471 70 249.5 471 472 73 251.5 474 475 75 253 476 477 80 257 482 484 85 262 489 490 90 267 497 499 95 275.5 512 514 97 280 524.5 525.5 98 283 533.5 534 99 287.5 552 552 31

Figure 3. Relative Frequency for DSTP (Grade 4) Figure 2. Relative Frequency for NAEP (Grade 4) 3.5 2.5 Relative Frequency 3 2.5 2 1.5 1 0.5 Relative Frequency 2 1.5 1 0.5 0 0 319 349 371 388 403 417 430 441 452 464 476 491 514 566 140 167 179 189 199 209 219 229 239 249 259 269 279 289 DSTP Scale Score NAEP Mean PL Figure 3. NAEP PL vs. DSTP Scale Score (GR 4) Figure 4. DSTP Scale Score vs. NAEP PL (GR 4) 350 600 NAEP Mean PL 300 250 200 150 DSTP Scale score 550 500 450 400 100 350 350 400 450 500 550 600 100 150 200 250 300 350 DSTP Scale Score NAEP Mean PL 32

Table 6. Scores Corresponding to at Selected Percentile Ranks for Grade 4 PR NAEP Plausible Value AIL MPL DSTP PL1 PL2 PL3 PL4 PL5 sample Population 1 170.5 168.5 172.5 171.5 171 170.8 174 366 364.5 3 186 185.5 187.5 187 187 186.6 189.5 385 385 4 190 189 191 191 190 190.2 193 390 390 5 193 192.5 193.5 193.5 192.5 193 196 394 395 6 195.5 196 195.5 195.5 194.5 195.4 198 397 399 10 202.5 204.5 202 203 203 203 204.5 409 409.5 15 209 210 209 209 210 209.4 210 418.5 419 19 213 214 214 213 214 213.6 214 425 425 20 214 215 215 214 215 214.6 215 426 426 25 219 219 219 219 220 219.2 220 432 432 30 223 223 223 223 224 223.2 223.5 437 437 35 226 227 226 227 227 226.6 227 441 441 40 230 230 230 230 231 230.2 230 445.5 446 45 233 234 233 233 234 233.4 233.5 449 449.5 50 236 237 236 237 237 236.6 236 453 454 55 240 240 240 240 240 240 239.5 457.5 458 60 243 243 243 243 244 243.2 243 461 462 65 246 247 246 247 247 246.6 246 466 467 69 249 249 249 249 249 249 249 470 471 70 250 250 250 250 250 250 249.5 471 472 75 254 254 254 254 254 254 253 476 477 80 258 258 258 258 258 258 257 482 686 85 263 263 263 263 263 263 262 489 490 90 268.5 268 268.5 268.5 268 268.3 267 497 499 95 277 276 277.5 276.5 277 276.8 275.5 512 514 96 279.5 279 279.5 278.5 279.5 279.2 277.5 517.5 518 97 282 282 282 281 282 281.8 280 524.5 525.5 98 285 285.5 286 285 285.5 285.4 283 533.5 534 99 292 290.5 291.5 291 291.5 291.3 287.5 552 552 AIL represents the average of five independent linkages. MPL represents the mean of five plausible values. 33