3 Assessment Design and Development

Chapter 3 of the Dynamic Learning Maps^® (DLM^®) Alternate Assessment System 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017) describes assessment design and development procedures. This chapter provides an overview of updates to assessment design and development for 2022–2023. The chapter first describes the design of science testlets. The chapter then provides an overview of 2022–2023 item writers’ characteristics and the 2022–2023 external review of items and testlets based on criteria for content, bias, and accessibility. The chapter concludes by presenting evidence of item quality, including a summary of field-test data analysis and associated reviews, a summary of the pool of operational testlets available for administration, and an evaluation of differential item functioning (DIF).

3.1 Assessment Structure

The DLM Alternate Assessment System uses learning maps as the basis for assessment, which are highly connected representations of how academic skills are acquired as reflected in the research literature. Nodes in the maps represent specific knowledge, skills, and understandings in science, as well as important foundational skills that provide an understructure for academic skills. The maps go beyond traditional learning progressions to include multiple pathways by which students develop content knowledge and skills.

Three broad claims were developed for science, which were then subdivided into 16 conceptual areas, to organize the highly complex learning maps. For a complete description, see Chapter 2 of the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017). Claims are overt statements of what students are expected to learn and be able to demonstrate as a result of mastering skills within a very large neighborhood of the map. Conceptual areas are nested within claims and comprise multiple conceptually related content standards and the nodes that support and extend beyond the standards. The claims and conceptual areas apply to all grades in the DLM system.

Essential Elements (EEs) are specific statements of knowledge and skills, analogous to alternate or extended content standards. The EEs were developed by linking to the grade-level expectations identified in the Common Core State Standards. The purpose of the EEs is to build a bridge from the Common Core State Standards to academic expectations for students with the most significant cognitive disabilities.

Testlets are the basic units of measurement in the DLM system. Testlets are short, instructionally relevant measures of student knowledge, skills, and understandings. Each testlet is made up of three to five assessment items. Assessment items are developed based on nodes at the three linkage levels for each EE. Each testlet measures an EE and linkage level. The Target linkage level reflects the grade-level expectation aligned directly to the EE. For each EE, small collections of nodes are identified earlier in the map that represent critical junctures on the path toward the grade-level expectation.

There are two levels below the Target.

Initial
Precursor
Target

3.2 Testlet and Item Writing

This section describes information pertaining to item writing and item writer demographics for the 2022–2023 year. For a complete summary of item and testlet development procedures, see Chapter 3 of the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017).

3.2.1 2023 Testlet and Item Writing

Item development for 2022–2023 focused on replenishing and increasing the pool of testlets in all content areas. A total of 100 science testlets were produced by 21 item writers.

3.2.1.1 Item Writers

Item writers were selected from the Accessible Teaching, Learning, and Assessment Systems (ATLAS) MemberClicks database. The database is a profile-based recruitment tool hosted in MemberClicks and includes individuals recruited via the DLM governance board and social media, individuals who have previously participated in item writing, and individuals who created profiles via the “sign up to participate in DLM events” link on the DLM homepage. Interested individuals create and update their participant profile. Participant profiles include demographic, education, and work experience data.

A total of 695 individual profiles were initially invited to participate from the ATLAS MemberClicks database for 2023 item writing. Minimum eligibility criteria included at least 1 year of teaching experience, teaching in a DLM state, and experience with the DLM alternate assessment. Prior DLM event participation, subject matter expertise, population expertise, and distribution of experience in each grade band was also considered in selection and assignment to a subject area. Of the 695 individuals initially invited to participate, 21 individuals registered, completed advance training, and committed to attend the workshop. All 21 registered item writers attended both days of the training event and completed at least rounds 1 and 2 of item writing.

The demographics for the item writers are presented in Table 3.1. The median and range of years of item writers’ teaching experience are shown in Table 3.2. Of the item writers who responded to the question, the median years of experience in pre-K–12 was 16 years for item writers of science testlets. Item writers equally represented Grades 3–8, with slightly greater representation of high school. See Table 3.3 for a summary of item writers’ grade-level teaching experience.

Table 3.1: Demographics of the Item Writers
	n	%
Gender
Female	18	85.7
Male	3	14.3
Race
White	20	95.2
African American	1	4.8
Hispanic ethnicity
Non-Hispanic	20	95.2
Chose not to disclose	1	4.8

Table 3.2: Item Writers’ Years of Teaching Experience
Teaching Experience	n	Median	Range
Pre-K–12	20	16.0	3–30
Science	19	12.0	1–30
^* The n column indicates the number of nonmissing responses to the survey question

Table 3.3: Item Writers’ Grade-Level Teaching Experience
Grade level	n	%
Grade 3	8	38.1
Grade 4	8	38.1
Grade 5	6	28.6
Grade 6	7	33.3
Grade 7	9	42.9
Grade 8	8	38.1
High school	12	57.1
Note. Item writers could indicate multiple grade levels.

The 21 item writers represented a highly qualified group of professionals with both content and special education perspectives. The degrees held by item writers are shown in Table 3.4. All item writers held at least a bachelor’s degree. The vast majority of of the item writers (n = 19; 90%) also held a master’s degree, for which the most common field of study was special education (n = 11; 58%).

Table 3.4: Item Writers’ Degree Type (N = 21)
Degree	n	%
Bachelor’s degree	21	100.0
Education	4	19.0
Special education	10	47.6
Other	6	28.6
Missing	1	4.8
Master’s degree	19	90.5
Education	3	15.8
Special education	11	57.9
Other	5	26.3

Item writers reported a range of experience working with students with different disabilities, as summarized in Table 3.5. Item writers collectively had the most experience working with students with a significant cognitive disability (n = 16; 76%), a mild cognitive disability (n = 15; 71%), or multiple disabilities (n = 14; 67%).

Table 3.5: Item Writers’ Experience With Disability Categories
Disability category	n	%
Blind/low vision	6	28.6
Deaf/hard of hearing	6	28.6
Emotional disability	11	52.4
Mild cognitive disability	15	71.4
Multiple disabilities	14	66.7
Orthopedic impairment	6	28.6
Other health impairment	14	66.7
Significant cognitive disability	16	76.2
Specific learning disability	11	52.4
Speech impairment	10	47.6
Traumatic brain injury	4	19.0
No disability	2	9.5
Note. Item writers could select multiple categories.

The professional roles reported by the 2022–2023 item writers are shown in Table 3.6. While item writers had a range of professional roles, they were primarily classroom educators.

Table 3.6: Professional Roles of Item Writers
Role	n	%
Classroom educator	16	76.2
District staff	1	4.8
Instructional coach	1	4.8
State education agency	3	14.3

Item writers came from 11 different states. The geographic areas of the institutions in which item writers taught or held a position is reported in Table 3.7. Within the survey, rural was defined as a population of fewer than 2,000 inhabitants, suburban was defined as a city of 2,000–50,000 inhabitants, and urban was defined as a city of more than 50,000 inhabitants.

Table 3.7: Institution Geographic Areas for Item Writers
Geographic area	n	%
Rural	10	47.6
Suburban	6	28.6
Urban	5	23.8

3.2.1.2 Item-Writing Process

The item-writing process for 2022–2023 began with item writers completing three advance training modules: an overview of the DLM module and two content-specific modules. In January 2023, item writers and staff gathered in Kansas City for an on-site item-writing workshop. During this workshop, item writers received additional training and worked on producing and peer reviewing two testlets. Following the on-site workshop, item writers continued producing and peer reviewing testlets virtually via a secure online platform through April 2023. A total of 100 testlets were written for science.

3.2.2 External Reviews

3.2.2.1 Items and Testlets

The purpose of external reviews of items and testlets is to evaluate whether items and testlets measure the intended content, are accessible, and are free of bias or sensitive content. Panelists use external review criteria established for DLM alternate assessments to rate items and recommend to “accept”, “revise”, or “reject” items and testlets. External review panelists provide recommendations for revise ratings and explanations for reject ratings. The test development team uses collective feedback from the panelists to inform decisions about items and testlets prior to field-testing.

The content reviews of items and testlets for 2022–2023 were conducted during 2-day virtual meetings. The accessibility and bias/sensitivity reviews for 2022–2023 were conducted during 3-day virtual meetings.

3.2.2.1.1 Overview of Review Process

Panelists were selected from the ATLAS MemberClicks database based on predetermined qualifications for each panel type. Panelists were assigned to content, accessibility, or bias and sensitivity panels based on their qualifications.

In 2023, there were 36 panelists. Of those, 16 were science panelists. There were also 16 accessibility panelists and four bias and sensitivity panelists who reviewed items and testlets from all subjects.

Prior to participating in the virtual panel meetings, panelists completed an advance training course that included an External Review Procedures module and a module that specifically aligned to their assigned panel type. The content modules were subject-specific, while the accessibility and bias and sensitivity modules were universal for science. After each module, panelists completed a quiz and were required to score 80% or higher to continue to the next stage of training.

Following the completion of advance training, panelists met with their panels in a virtual environment. Each panel was led by an ATLAS facilitator and co-facilitator. Facilitators provided additional training on the training platform and criteria used to review items and testlets. Panelists began their reviews by engaging in a calibration collection (two testlets) to calibrate their expectations for the review. Following the calibration sets, panelists reviewed collections of items and testlets independently. Once all panelists completed the review, facilitators used a discussion protocol known as the Rigorous Item Feedback protocol to discuss any items or testlets that were rated either revise or reject by a panelist to obtain collective feedback about those items and testlets. The Rigorous Item Feedback protocol helps facilitators elicit detailed, substantive feedback from panelists and record feedback in a uniform fashion. Following the discussion, panelists were given another collection of items and testlets to review. This process was repeated until all collections of items and testlets were reviewed. Collections ranged from eight to 17 testlets, depending on the panel type. Content panels had fewer testlets per collection, and bias and sensitivity and accessibility panels had more testlets per collection.

3.2.2.1.2 External Reviewers

The demographics for the external reviewers are presented in Table 3.8. The median and range of years of teaching experience are shown in Table 3.9. The median years of experience for external reviewers was 13 years in pre-K–12 and 12 years in science. External reviewers represented all grade levels, with slightly greater representation for Grades 6–8 and high school. See Table 3.10 for a summary.

Table 3.8: Demographics of the External Reviewers
	n	%
Gender
Female	29	80.6
Male	7	19.4
Race
White	33	91.7
African American	3	8.3
Hispanic ethnicity
Non-Hispanic	35	97.2
Hispanic	1	2.8

Table 3.9: External Reviewers’ Years of Teaching Experience
Teaching experience	Median	Range
Pre-K–12	13.0	7–30
Science	12.0	3–25

Table 3.10: External Reviewers’ Grade-Level Teaching Experience
Grade level	n	%
Grade 3	13	36.1
Grade 4	14	38.9
Grade 5	17	47.2
Grade 6	25	69.4
Grade 7	26	72.2
Grade 8	25	69.4
High school	22	61.1
Note. Reviewers could indicate multiple grade levels.

The 36 external reviewers represented a highly qualified group of professionals. The level of degree and most common types of degrees held by external reviewers are shown in Table 3.11. A majority (n = 28; 78%) also held a master’s degree, for which the most common field of study was special education (n = 11; 39%).

Table 3.11: External Reviewers’ Degree Type
Degree	n	%
Bachelor’s degree	36	100.0
Education	11	30.6
Special education	10	27.8
Other	12	33.3
Missing	3	8.3
Master’s degree	28	77.8
Education	9	32.1
Special education	11	39.3
Other	8	28.6

Most external reviewers had experience working with students with disabilities (92%), and 83% had experience with the administration of alternate assessments. The variation in percentages suggests some item writers may have had experience with administration of alternate assessments but perhaps did not regularly work with students with disabilities.

External reviewers reported a range of experience working with students with different disabilities, as summarized in Table 3.12. External reviewers collectively had the most experience working with students with a specific learning disability (n = 29; 81%), an emotional disability (n = 28; 78%), or multiple disabilities (n = 28; 78%).

Table 3.12: External Reviewers’ Experience With Disability Categories
Disability category	n	%
Blind/low vision	18	50.0
Deaf/hard of hearing	13	36.1
Emotional disability	28	77.8
Mild cognitive disability	27	75.0
Multiple disabilities	28	77.8
Orthopedic impairment	10	27.8
Other health impairment	23	63.9
Significant cognitive disability	24	66.7
Specific learning disability	29	80.6
Speech impairment	21	58.3
Traumatic brain injury	9	25.0
Note. Reviewers could select multiple categories.

Panelists had varying experience teaching students with the most significant cognitive disabilities. Science panelists had a median of 11 years of experience teaching students with the most significant cognitive disabilities, with a minimum of 2 years and a maximum of 20 years of experience.

The professional roles reported by the 2022–2023 reviewers are shown in Table 3.13. While the reviewers had a range of professional roles, they were primarily classroom educators.

Table 3.13: Professional Roles of External Reviewers
Role	n	%
Classroom educator	32	88.9
Instructional coach	3	8.3
Other	1	2.8

Science panelists were from nine different states. The geographic areas of institutions in which reviewers taught or held a position is reported in Table 3.14. Within the survey, rural was defined as a population of fewer than 2,000 inhabitants, suburban was defined as a city of 2,000–50,000 inhabitants, and urban was defined as a city of more than 50,000 inhabitants.

Table 3.14: Institution Geographic Areas for Content Panelists
Geographic area	n	%
Rural	20	55.6
Suburban	8	22.2
Urban	8	22.2

3.2.2.1.3 Results of External Reviews

The percentage of items and testlets rated as “accept” across panels and rounds of review ranged from 41% to 96% and from 64% to 85%, respectively. The percentage of items and testlets rated as “revise” across panels and rounds of review ranged from 4% to 50% and from 15% to 24%, respectively. The rate at which items and testlets were recommended for rejection ranged from 0% to 9% and 0% to 12% respectively, across panels and rounds of review.

3.2.2.1.4 Test Development Team Decisions

Because each item and testlet is examined by three distinct panels, ratings were compiled across panels, following the process described in Chapter 3 of the 2021–2022 Technical Manual Update—Science (Dynamic Learning Maps Consortium, 2022). The test development team reviews the collective feedback provided by panelists for each item and testlet. Once the test development team views each item and testlet and considers the feedback provided by the panelists, it assigns one of the following decisions to each one: (a) accept as is; (b) minor revision, pattern of minor concerns, will be addressed; (c) major revision needed; (d) reject; and (e) more information needed.

The science test development team accepted 57% of testlets and 42% of items as is. Of the items and testlets that were revised, most required major changes (e.g., stem or response option replaced) as opposed to minor changes (e.g., minor rewording but concept remained unchanged). The science test development team made 34 minor revisions and 184 major revisions to items and rejected 14 testlets.

Most of the items and testlets reviewed will be field-tested during the spring 2024 window.

3.3 Evidence of Item Quality

Each year, testlets are added to and removed from the operational pool to maintain a pool of high-quality testlets. The following sections describe evidence of item quality, including evidence supporting field-test testlets available for administration, a summary of the operational pool, and evidence of DIF.

3.3.1 Field-Testing

During 2022–2023, DLM field-test testlets were administered to evaluate item quality before promoting testlets to the operational pool. Adding additional testlets to the operational pool allows for multiple testlets to be available in the instructionally embedded and spring assessment windows. This means that teachers have the ability to assess the same EE and linkage level multiple times in the instructionally embedded window, if desired, and reduces item exposure for the EEs and linkage levels that are assessed most frequently. Additionally, deepening the operational pool allows for testlets to be evaluated for retirement in instances in which other testlets show better performance.

In this section, we describe the field-test testlets administered in 2022–2023 and the associated review activities. A summary of prior field-test events can be found in Chapter 3 of the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017).

3.3.1.1 Description of Field Tests Administered in 2022–2023

Testlets were made available for field-testing based on the availability of field-test content for each EE and linkage level.

During the spring assessment window, field-test testlets were administered to each student after completion of the operational assessment. A field-test testlet was assigned for an EE that was assessed during the operational assessment at a linkage level equal or adjacent to the linkage level of the operational testlet.

Table 3.15 summarizes the number of field-test testlets available during 2023. A total of 84 were available across grades.

Table 3.15: Spring 2023 Field-Test Testlets
Grade	n
Elementary	31
Middle school	18
High school	19
Biology	16

Table 3.16 presents the demographic breakdown of students completing field-test testlets in science in 2022–2023. Consistent with the DLM population, approximately 67% of students completing field-test testlets were male, approximately 60% were White, and approximately 75% were non-Hispanic. The vast majority of students completing field-test testlets were not English-learner eligible or monitored. The students completing field-test testlets were split across the four complexity bands, with most students assigned to Band 1 or Band 2.² See Chapter 4 of this manual for a description of the student complexity bands.

Table 3.16: Demographic Summary of Students Participating in Field Tests
Demographic group	n	%
Gender
Female	8,611	33.4
Male	17,177	66.5
Nonbinary/undesignated	26	0.1
Race
White	15,539	60.2
African American	5,159	20.0
Asian	1,278	5.0
American Indian	723	2.8
Alaska Native	57	0.2
Two or More Races	2,884	11.2
Native Hawaiian or Pacific Islander	174	0.7
Hispanic ethnicity
Non-Hispanic	20,202	78.3
Hispanic	5,612	21.7
English learning (EL) participation
Not EL eligible or monitored	24,285	94.1
EL eligible or monitored	1,529	5.9
Science complexity band
Foundational	4,061	15.7
Band 1	11,036	42.8
Band 2	7,716	29.9
Band 3	3,001	11.6
Note See Chapter 4 of this manual for a description of student complexity bands.

Participation in field-testing was not required, but educators were encouraged to administer all available testlets to their students. In total, 25,814 (62%) students completed at least one field-test testlet. In the spring assessment window, >99% of field-test testlets had a sample size of at least 20 students (i.e., the threshold for item review).

Table 3.17: 2022–2023 Field-Test Participation
	Spring assessment window
Subject	n	%
Science	25,814	61.7

3.3.1.2 Results of Item Analysis

All flagged items are reviewed by test development teams following field-testing. Items are flagged if they meet either of the following statistical criteria:

The item is too challenging, as indicated by a p-value less than .35. This value was selected as the threshold for flagging because most DLM assessment items offer three response options, so a value less than .35 may indicate less than chance selection of the correct response option.
The item is significantly easier or harder than other items assessing the same EE and linkage level, as indicated by a weighted standardized difference greater than two standard deviations from the mean p-value for that EE and linkage level combination.

Figure 3.1 summarizes the p-values for items that met the minimum sample size threshold of 20. Most items fell above the .35 threshold for flagging. A total of 1,800 items (98%) were above the .35 flagging threshold. Test development teams reviewed 30 items (2%) that were below the threshold.

Figure 3.1: p-values for Science 2022–2023 Field-Test Items

This figure contains a histogram displaying the number of science field test items within each p-value level bin.

Note. Items with a sample size less than 20 were omitted.

Items in the DLM assessments are designed and developed to be fungible (i.e., interchangeable) within each EE and linkage level, meaning field-test items should perform consistently with the operational items measuring the same EE and linkage level. To evaluate whether field-test items perform similarly to operational items measuring the same EE and linkage level, standardized difference values were also calculated for the field-test items. Figure 3.2 summarizes the standardized difference values for items field tested for science. Most items fell within two standard deviations of the mean for the EE and linkage level. Items beyond the threshold were reviewed by test development teams for each subject.

Figure 3.2: Standardized Difference Z-Scores for Science 2022–2023 Field-Test Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of science field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

A total of 12 science testlets (11%) had at least one item flagged due to their p-value and/or standardized difference value. Test development teams reviewed all flagged items and their context within the testlet to identify possible reasons for the flag and to determine whether an edit was likely to resolve the issue.

Of the 97 science testlets that were not flagged, 92 (95%) were promoted as is to the operational pool and five (5%) were rejected and retired. Of the 12 science testlets that were flagged, 12 (100%) were rejected and retired.

In addition to these reviews, field-test items were reviewed for DIF following the same procedures for items in the operational pool (see Section 3.3.3 of this manual). No field-test items in science were flagged for nonnegligible DIF.

3.3.1.3 Field-Test Data Review

Data collected during each field test are compiled, and statistical flags are implemented ahead of test development team review. Flagging criteria serve as a source of evidence for test development teams in evaluating item quality; however, final judgments are content based, taking into account the testlet as a whole, the linkage level the items were written to assess, and pool depth.

Review of field-test data occurs annually during February and March. This review includes data from the previous spring assessment window. That is, the review in February and March of 2023 includes field-test data collected during the 2022 spring assessment window. Data that were collected during the 2023 spring assessment window will be reviewed in February and March of 2024, with results included in the 2023–2024 technical manual update.

Test development teams for each subject classified each reviewed item into one of four categories:

No changes made to item. Test development team decided item can go forward to operational assessment.
Test development team identified concerns that required modifications. Modifications were clearly identifiable and were likely to improve item performance.
Test development team identified concerns that required modifications. The content was worth preserving rather than rejecting. Item review may not have clearly pointed to specific modifications that were likely to improve the item.
Rejected item. Test development team determined the item was not worth revising.

For an item to be accepted as is, the test development team had to determine that the item was consistent with DLM item-writing guidelines and that the item was aligned to the linkage level. An item or testlet was rejected completely if it was inconsistent with DLM item-writing guidelines, if the EE and linkage level were covered by other testlets that had better-performing items, or if there was no clear content-based revision to improve the item. In some instances, a decision to reject an item resulted in the rejection of the testlet, as well.

Common reasons for flagging an item for modification included items that were misaligned to the linkage level, distractors that could be argued as partially correct, or unnecessary complexity in the language of the stem. After reviewing flagged items, the test development team looked at all items classified into Category 3 or Category 4 within the testlet to help determine whether to retain or reject the testlet. Here, the test development team could elect to keep the testlet (with or without revision) or reject it. If a revision was needed, it was assumed the testlet needed field-testing again. The entire testlet was rejected if the test development team determined the flagged items could not be adequately revised.

3.3.2 Operational Assessment Items for 2022–2023

There were several updates to the pool of operational items for 2022–2023: 34 science testlets were promoted to the operational pool from field-testing in 2021–2022. No testlets were retired due to model fit. For a discussion of the model-based retirement process, see Chapter 5 of this manual.

Testlets were made available for operational testing in 2022–2023 based on the 2021–2022 operational pool and the promotion of testlets field-tested during 2021–2022 to the operational pool following their review. Table 3.18 summarizes the total number of operational testlets for 2022–2023. In total, there were 174 operational testlets available. This total included 36 EE/linkage level combinations for which both a general version and a version for students who are blind or visually impaired or read braille were available.

Operational assessments were administered during the spring assessment window. A total of 367,386 test sessions were administered during both assessment windows. One test session is one testlet taken by one student. Only test sessions that were complete at the close of each testing window counted toward the total sessions.

Table 3.18: 2022–2023 Operational Testlets, by Grade Band (N = 174)
Grade	n
Elementary	44
Middle school	45
High school	48
Biology	37
Note: Three Essential Elements are shared across the high school and Biology assessments.

3.3.2.1 Educator Perception of Assessment Content

Each year, test administrators are asked two questions about their perceptions of the assessment content;³ Participation in the test administrator survey is described in Chapter 4 of this manual. Table 3.19 describes their responses in 2022–2023. Questions pertained to whether the DLM assessments measured important academic skills and reflected high expectations for their students.

Test administrators generally responded that content reflected high expectations for their students (87% agreed or strongly agreed) and measured important academic skills (78% agreed or strongly agreed). While the majority of test administrators agreed with these statements, 13%–22% disagreed. DLM assessments represent a departure from the breadth of academic skills assessed by many states’ previous alternate assessments. Given the short history of general curriculum access for this population and the tendency to prioritize the instruction of functional academic skills (Karvonen et al., 2011), test administrators’ responses may reflect awareness that DLM assessments contain challenging content. However, test administrators were divided on its importance in the educational programs of students with the most significant cognitive disabilities. Feedback from focus groups with educators focusing on score reports reflected similar variability in educator perceptions of assessment content (Clark et al., 2018, 2022).

Table 3.19: Educator Perceptions of Assessment Content
	Strongly disagree		Disagree		Agree		Strongly agree
Statement	n	%	n	%	n	%	n	%
Content measured important academic skills and knowledge for this student.	1,807	8.3	2,975	13.7	13,033	59.9	3,958	18.2
Content reflected high expectations for this student.	942	4.4	1,945	9.0	12,894	59.7	5,811	26.9

3.3.2.2 Psychometric Properties of Operational Assessment Items for 2022–2023

The proportion correct (p-value) was calculated for all operational items to summarize information about item difficulty.

Figure 3.3 shows the distribution of p-values for operational items in science. To prevent items with small sample sizes from potentially skewing the results, the sample size cutoff for inclusion in the p-value plots was 20. In total, zero items (<1% of all items) were excluded due to small sample size. The p-values for most science items were between .4 and .7.

Figure 3.3: p-values for Science 2023 Operational Items

A histogram displaying p-value on the x-axis and the number of science operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Items in the DLM assessments are designed and developed to be fungible (i.e., interchangeable) within each EE and linkage level, meaning that the items are expected to function identically to the other items measuring the same EE and linkage level. To evaluate the fungibility assumption, standardized difference values were calculated for all operational items, with a student sample size of at least 20 required to compare the p-value for the item to all other items measuring the same EE and linkage level. If an item is fungible with the other items measuring the same EE and linkage level, the item is expected to have a nonsignificant standardized difference value. The standardized difference values provide one source of evidence of internal consistency.

Figure 3.4 summarizes the distribution of standardized difference values for operational items in science. Of all items measuring the EE and linkage level, >99% of items fell within two standard deviations of the mean. As additional data are collected and decisions are made regarding item pool replenishment, test development teams will consider item standardized difference values, along with item misfit analyses, when determining which items and testlets are recommended for retirement.

Figure 3.4: Standardized Difference Z-Scores for Science 2022–2023 Operational Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of science operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.5 summarizes the distributions of standardized difference values for operational items by linkage level. Most items fell within two standard deviations of the mean of all items measuring the respective EE and linkage level, and the distributions are consistent across linkage levels.

Figure 3.5: Standardized Difference Z-Scores for 2022–2023 Operational Items by Linkage Level

This figure contains a histogram displaying standardized difference on the x-axis and the number of science operational items on the y-axis. The histogram has a separate row for each linkages level.

Note. Items with a sample size less than 20 were omitted.

3.3.3 Evaluation of Item-Level Bias

DIF identifies instances where test items are more difficult for some groups of examinees despite these examinees having similar knowledge and understanding of the assessed concepts (Camilli & Shepard, 1994). DIF analyses can uncover internal inconsistency if particular items are functioning differently in a systematic way for identifiable subgroups of students (American Educational Research Association et al., 2014). While identification of DIF does not always indicate a weakness in the test item, it can point to construct-irrelevant variance, posing considerations for validity and fairness.

3.3.3.1 Method

DIF analyses followed the same procedure used in previous years and examined race in addition to gender. Analyses included data from 2015–2016 through 2021–2022⁴ DIF analyses are conducted on the sample of data used to update the model calibration, which uses data through the previous operational assessment. See Chapter 5 of this manual for more information. to flag items for evidence of DIF. Items were selected for inclusion in the DIF analyses based on minimum sample-size requirements for the three gender subgroups (female, male, and nonbinary/undesignated) and for race subgroups: African American, Alaska Native, American Indian, Asian, multiple races, Native Hawaiian or Pacific Islander, and White.

The DLM student population is unbalanced in both gender and race. The number of female students responding to items is smaller than the number of male students by a ratio of approximately 1:2, and nonbinary/undesignated students make up less than 0.1% of the DLM student population. Similarly, the number of non-White students responding to items is smaller than the number of White students by a ratio of approximately 1:2. Therefore, on advice from the DLM Technical Advisory Committee, the threshold for including an item in DIF analysis requires that the focal group (i.e., the historically disadvantaged group) must have at least 100 students responding to the item. The threshold of 100 was selected to balance the need for a sufficient sample size in the focal group with the relatively low number of students responding to many DLM items.

Consistent with previous years, additional criteria were included to prevent estimation errors. Items with an overall proportion correct (p-value) greater than .95 or less than .05 were removed from the analyses. Items for which the p-value for one gender or racial group was greater than .97 or less than .03 were also removed from the analyses.

For each item, logistic regression was used to predict the probability of a correct response, given group membership and performance in the subject. Specifically, the logistic regression equation for each item included a matching variable comprised of the student’s total linkage levels mastered in the subject of the item and a group membership variable, with the reference group (i.e., males for gender, White for race) coded as 1 and the focal group (i.e., females or nonbinary/undesignated for gender; African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, or two or more races for race) coded as 0. An interaction term was included to evaluate whether nonuniform DIF was present for each item (Swaminathan & Rogers, 1990); the presence of nonuniform DIF indicates that the item functions differently because of the interaction between total linkage levels mastered and the student’s group (i.e., gender or racial group). When nonuniform DIF is present, the group with the highest probability of a correct response to the item differs along the range of total linkage levels mastered; thus, one group is favored at the low end of the spectrum and the other group is favored at the high end.

Three logistic regression models were fitted for each item:

\[\begin{align} \text{M}_0\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} \tag{3.1} \\ \text{M}_1\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G \tag{3.2} \\ \text{M}_2\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G + \beta_3\text{X}G\tag{3.3} \end{align}\]

where \(\pi_i\) is the probability of a correct response to item i, \(\text{X}\) is the matching criterion, \(G\) is a dummy coded grouping variable (0 = reference group, 1 = focal group), \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(\beta_2\) is the group-specific parameter, and \(\beta_3\) is the interaction term.

Because of the number of items evaluated for DIF, Type I error rates were susceptible to inflation. The incorporation of an effect-size measure can be used to distinguish practical significance from statistical significance by providing a metric of the magnitude of the effect of adding group and interaction terms to the regression model.

For each item, the change in the Nagelkerke pseudo \(R^2\) measure of effect size was captured, from \(M_0\) to \(M_1\) or \(M_2\), to account for the effect of the addition of the group and interaction terms to the equation. All effect-size values were reported using both the Zumbo and Thomas (1997) and Jodoin and Gierl (2001) indices for reflecting a negligible (also called A-level DIF), moderate (B-level DIF), or large effect (C-level DIF). The Zumbo and Thomas thresholds for classifying DIF effect size are based on Cohen’s (1992) guidelines for identifying a small, medium, or large effect. The thresholds for each level are .13 and .26; values less than .13 have a negligible effect, values between .13 and .26 have a moderate effect, and values of .26 or greater have a large effect. The Jodoin and Gierl thresholds are more stringent, with lower threshold values of .035 and .07 to distinguish between negligible, moderate, and large effects.

3.3.3.2 DIF Results

Using the above criteria for inclusion, 573 (86%) items were selected for at least one gender group comparison, and 504 (75%) items were selected for at least one racial group comparison. The number of items evaluated by grade in science for gender ranged from 149 in grade 3–5 to 250 in grade 9–12. Because students taking DLM assessments represent three possible gender groups, there are up to two comparisons that can be made for each item, with the male group as the reference group and each of the other two groups (i.e., female, nonbinary/undesignated) as the focal group. Across all items, this results in 1,338 possible comparisons. Using the inclusion criteria specified above, 573 (43%) item and focal group comparisons were selected for analysis. All 573 items were evaluated for the female focal group. The number of items evaluated by grade in science for race ranged from 129 in grade 3–5 to 221 in grade 9–12. Because students taking DLM assessments represent seven possible racial groups,⁵ See Chapter 7 of this manual for a summary of participation by race and other demographic variables. there are up to six comparisons that can be made for each item, with the White group as the reference group and each of the other six groups (i.e., African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, two or more races) as the focal group. Across all items, this results in 4,014 possible comparisons. Using the inclusion criteria specified above, 1,849 (46%) item and focal group comparisons were selected for analysis. Overall, 75 items were evaluated for one racial focal group, two items were evaluated for three racial focal groups, 367 items were evaluated for four racial focal groups, and 60 items were evaluated for five racial focal groups. One racial focal group and the White reference group were used in each comparison. Table 3.20 shows the number of items that were evaluated for each racial focal group. Across all gender and race comparisons, sample sizes for each comparison ranged from 259 to 28,416 for gender and from 224 to 23,579 for race.

Table 3.20: Number of Items Evaluated for Differential Item Functioning for Each Race
Focal group	Items (n)
African American	504
American Indian	427
Asian	429
Native Hawaiian or Pacific Islander	60
Two or more races	429

Table 3.21 and Table 3.22 show the number and percentage of subgroup combinations that did not meet each inclusion criteria for gender and race, respectively, by the linkage level the items assess. A total of 96 items were not included in the DIF analysis for gender for any of the subgroups. Of the 765 item and focal group comparisons that were not included in the DIF analysis for gender, 765 (100%) had a focal group sample size of less than 100. A total of 165 items were not included in the DIF analysis for race for any of the subgroups. Of the 2,165 item and focal group comparisons that were not included in the DIF analysis for race, 2,157 (>99%) had a focal group sample size of less than 100, 1 (<1%) had an item p-value greater than .95, and 7 (<1%) had a subgroup p-value greater than .97.

Table 3.21: Comparisons Not Included in Differential Item Functioning Analysis for Gender, by Linkage Level
	Sample size		Item proportion correct		Subgroup proportion correct
Subject	n	%	n	%	n	%
Initial	268	35.0	0	0.0	0	0.0
Precursor	277	36.2	0	0.0	0	0.0
Target	220	28.8	0	0.0	0	0.0

Table 3.22: Comparisons Not Included in Differential Item Functioning Analysis for Race, by Linkage Level
	Sample size		Item proportion correct		Subgroup proportion correct
Subject	n	%	n	%	n	%
Initial	675	31.3	0	0.0	1	14.3
Precursor	843	39.1	0	0.0	1	14.3
Target	639	29.6	1	100.0	5	71.4

3.3.3.2.1 Uniform Differential Item Functioning Model

A total of 115 items for gender were flagged for evidence of uniform DIF. Additionally, 287 item and focal group combinations across 206 items for race were flagged for evidence of uniform DIF. Table 3.23 and Table 3.24 summarize the total number of combinations flagged for evidence of uniform DIF by grade for gender and race, respectively. The percentage of combinations flagged for uniform DIF ranged from 15% to 25% for gender and from 14% to 17% for race.

Table 3.23: Combinations Flagged for Evidence of Uniform Differential Item Functioning for Gender
Grade	Items flagged (n)	Total items (N)	Items flagged (%)
3–5	23	149	15.4
6–8	43	174	24.7
9–12	49	250	19.6

Table 3.24: Combinations Flagged for Evidence of Uniform Differential Item Functioning for Race
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	75	523	14.3	2
6–8	90	608	14.8	1
9–12	122	718	17.0	0

For gender, using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender term was added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender term was added to the regression equation.

The results of the DIF analyses for race were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all but three combinations were found to have a negligible effect-size change after the race term was added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all but three combinations were found to have a negligible effect-size change after the race term was added to the regression equation.

Table 3.25 provides information about the flagged items with a nonnegligible effect-size change after the addition of the group term, as represented by a value of B (moderate) or C (large). The test development team reviews all items flagged with a moderate or large effect size. The \(\beta_2G\) values (i.e., the coefficients for the group term) in Table 3.25 indicate which group was favored on the item after accounting for total linkage levels mastered, with positive values indicating that the focal group had a higher probability of success on the item and negative values indicating that the focal group had a lower probability of success on the item. The focal group was favored on one combination.

Table 3.25: Combinations Flagged for Uniform Differential Item Functioning (DIF) With Moderate or Large Effect Size
Item ID	Focal	Grade	EE	\(\chi^2\)	\(p\)-value	\(\beta_2G\)	\(R^2\)	Z&T^*	J&G^*
55919	African American	3–5	SCI.EE.5.PS1-2	6.88	.009	0.15	.848	C	C
49044	African American	3–5	SCI.EE.5.LS1-1	11.38	<.001	−0.23	.771	C	C
50121	African American	6–8	SCI.EE.MS.ESS2-6	22.19	<.001	−0.25	.848	C	C
Note. EE = Essential Element; \(\beta_2G\) = the coefficient for the group term (Equation (3.3)); Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
^* Effect-size measure: A indicates evidence of negligible DIF, B indicates evidence of moderate DIF, and C indicates evidence of large DIF.

3.3.3.2.2 Nonuniform Differential Item Functioning Model

A total of 157 items were flagged for evidence of nonuniform DIF. Additionally, 324 item and focal group combinations across 228 items were flagged for evidence of nonuniform DIF when both the race and interaction terms were included in the regression equation. Table 3.26 and Table 3.27 summarize the number of combinations flagged by grade. The percentage of combinations flagged ranged from 22% to 36% for gender and from 16% to 19% for race.

Table 3.26: Items Flagged for Evidence of Nonuniform Differential Item Functioning for Gender
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	39	149	26.2	0
6–8	63	174	36.2	1
9–12	55	250	22.0	1

Table 3.27: Items Flagged for Evidence of Nonuniform Differential Item Functioning for Race
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	90	523	17.2	2
6–8	96	608	15.8	1
9–12	138	718	19.2	0

Using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but two combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation.

The results of the DIF analyses for race were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all but three combinations were found to have a negligible effect-size change after the race and interaction terms were added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all but three combinations were found to have a negligible effect-size change after the race and interaction terms were added to the regression equation.

Information about the flagged items with a nonnegligible change in effect size after adding both the group and interaction term is summarized in Table 3.28, where B indicates a moderate effect size and C a large effect size. The test development team reviews all items flagged with a moderate or large effect size. In total, one combination had a moderate effect size and four combinations had a large effect size. The combination flagged for nonuniform DIF is the same combination flagged for DIF for the uniform model. The \(\beta_3\text{X}G\) values in Table 3.28 indicate which group was favored at lower and higher numbers of linkage levels mastered. A total of three combinations favored the focal group at higher numbers of total linkage levels mastered and the reference group at lower numbers of total linkage levels mastered.

Table 3.28: Combinations Flagged for Nonuniform Differential Item Functioning (DIF) With Moderate or Large Effect Size
Item ID	Focal	Grade	EE	\(\chi^2\)	\(p\)-value	\(\beta_2G\)	\(\beta_3\text{X}G\)	\(R^2\)	Z&T^*	J&G^*
49044	African American	3–5	SCI.EE.5.LS1-1	25.86	<.001	0.10	−0.10	.771	C	C
55919	African American	3–5	SCI.EE.5.PS1-2	12.21	.002	−0.36	0.03	.848	C	C
50095	Female	6–8	SCI.EE.MS.ESS2-2	57.79	<.001	0.00	0.02	.928	C	C
50121	African American	6–8	SCI.EE.MS.ESS2-6	31.75	<.001	0.15	−0.05	.848	C	C
70791	Female	9–12	SCI.EE.HS.PS3-4	12.81	.002	−0.83	0.31	.050	A	B
Note. EE = Essential Element; \(\beta_2G\) = the coefficient for the group term (Equation (3.3)); Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
^* Effect-size measure: A indicates evidence of negligible DIF, B indicates evidence of moderate DIF, and C indicates evidence of large DIF.

3.4 Conclusion

During 2022–2023, the test development teams conducted virtual events for both item writing and external review. Overall, 100 testlets were written for science. Following external review, the test development team retained 57% of science testlets. Of the content already in the operational pool, most items had p-values within two standard deviations of the mean for the EE and linkage level, three items were flagged for nonnegligible uniform DIF, and five items were flagged for nonnegligible nonuniform DIF. Field-testing in 2022–2023 focused on collecting data to refresh the operational pool of testlets.