School Information System: JUDGING THE QUALITY OF K-12 MATHEMATICS EVALUATIONS

November 30, 2004

JUDGING THE QUALITY OF K-12 MATHEMATICS EVALUATIONS

On Evaluating Curricular Effectiveness: Judging the Quality of K-12 Mathematics Evaluations (2004)
Curricula play a vital role in educational practice. They provide a crucial link between standards and accountability measures. They shape and are shaped by the professionals who teach with them. Typically, they also determine the content of the subjects being taught. Furthermore, because decisions about curricula are typically made at the local level in the United States, a wide variety of curricula are available for any given subject area.

On Evaluating Curricular Effectiveness: Judging the Quality of K-12 Mathematics Evaluations (2004)

Executive Summary
Curricula play a vital role in educational practice. They provide a crucial link between standards and accountability measures. They shape and are shaped by the professionals who teach with them. Typically, they also determine the content of the subjects being taught. Furthermore, because decisions about curricula are typically made at the local level in the United States, a wide variety of curricula are available for any given subject area. Clearly, knowing how effective a particular curriculum is, and for whom and under what conditions it is effective, represents a valuable and irreplaceable source of information to decision makers, whether they are classroom teachers, parents, district curriculum specialists, school boards, state adoption boards, curriculum writers and evaluators, or national policy makers. Evaluation studies can provide that information but only if those evaluations meet standards of quality.

Under the auspices of the National Research Council, this committee’s charge was to evaluate the quality of the evaluations of the 13 mathematics curriculum materials supported by the National Science Foundation (NSF) (an estimated $93 million) and 6 of the commercially generated mathematics curriculum materials (listing in Chapter 2).
The committee was charged to determine whether the currently available data are sufficient for evaluating the effectiveness of these materials and, if these data are not sufficiently robust, the committee was asked to develop recommendations about the design of a subsequent project that could result in the generation of more reliable and valid data for evaluating these materials.

The committee emphasizes that it was not charged with and therefore did not:
• Evaluate the curriculum materials directly; or
• Rate or rank specific curricular programs.
In addressing its charge, the committee held fast to a single commitment: that our greatest contribution would be to clarify the proper elements of an array of evaluation studies designed to judge the effectiveness of mathematics curricula and clarify what standards of evidence would need to be met to draw conclusions on effectiveness.

ASSESSMENT OF EXISTING STUDIES
The committee began by systematically identifying and examining the large array of evaluation studies available on these 19 curricula. In all, 698 studies were found. The first step in our process was to eliminate studies that were clearly not evaluations of effectiveness—those lacking relevance or adequacy for the task (e.g., product descriptions, editorials) (n=281), and those classified as providing background information, historical perspective, or a project update (n=225). We then categorized the remaining (192) studies into the four major evaluation methodologies—content analyses (n=36), comparative studies (n=95), case studies (n=45), and syntheses (n=16). Criteria by which to judge methodological adequacy, specific to each methodology, were then used to decide whether studies should be retained for further examination by the committee.

Content analyses focus almost exclusively on examining the content of curriculum materials; these analyses usually rely on expert review and judgments about such things as accuracy, depth of coverage, or the logical sequencing of topics. For the 36 studies classified as content analyses, the committee drew on the perspectives of eight prominent mathematicians and mathematics educators, in addition to applying the criteria of requiring full reviews of at least one year of curricular material. All 36 studies of this type were retained for further analysis by the committee.

Comparative studies involve the selection of pertinent variables on which to compare two or more curricula and their effects on student learning over significant time periods. For the 95 comparative studies, the committee stipulated that they had to be “at least minimally methodologically adequate,” which required that a study:

• Include quantifiably measurable outcomes such as test scores, responses to specified cognitive tasks of mathematical reasoning, performance evaluations, grades, and subsequent course taking; and
• Provide adequate information to judge the comparability of samples.
In addition, a study must have included at least one of the following additional design elements:
• A report of implementation fidelity or professional development activity;
• Results disaggregated by content strands or by performance by student subgroups; or
• Multiple outcome measures or precise theoretical analysis of a measured construct, such as number sense, proof, or proportional reasoning.
The application of these criteria led to the elimination of 32 comparative studies.

Case studies focus on documenting how program theories and components of a particular curriculum play out in a particular real-life situation. These studies usually describe in detail the large number of factors that influence implementation of that curriculum in classrooms or schools. For the 45 case studies, 13 were eliminated leaving 32 that met our standards of methodological rigor.
Synthesis studies summarize several evaluation studies across a particular curriculum, discuss the results, and draw conclusions based on the data and discussion. All of the 16 synthesis studies were retained for further examination by the committee.

The committee then had a total of 147 studies that met our minimal criteria for consideration of effectiveness, barely more than 20 percent of the total number of submissions with which we began our work. Seventy-five percent of these studies were related to the curricula supported by the National Science Foundation. The remaining studies concerned commercially supported curricular materials.
On the basis of the committee’s analysis of these 147 studies, we concluded that the corpus of evaluation studies as a whole across the 19 programs studied does not permit one to determine the effectiveness of individual programs with a high degree of certainty, due to the restricted number of studies for any particular curriculum, limitations in the array of methods used, and the uneven quality of the studies.

This inconclusive finding should not be interpreted to mean that these curricula are not effective, but rather that problems with the data and/or study designs prevent confident judgments about their effectiveness. Inconclusive findings such as these do not permit one to determine conclusively whether the programs overall are effective or ineffective.

A FRAMEWORK FOR FUTURE EVALUATIONS
Given this conclusion, the committee turned to the second part of its charge, developing recommendations for future evaluation studies. To do so, the committee developed a framework for evaluating curricular effectiveness. It permitted the committee to compare evaluations and consider how to identify and distinguish among the variety of methodologies employed.

The committee recommends that individuals or teams charged with curriculum evaluations make use of this framework. The framework has three major components that should be examined in each curriculum evaluation: (1) the program materials and design principles; (2) the quality, extent, and means of curricular implementation; and (3) the quality, breadth, type, and distribution of outcomes of student learning over time.

The quality of an evaluation depends on how well it connects these components into a research design and measurement of constructs and carries out a chain of reasoning, evidence, and argument to show the effects of curricular use.

ESTABLISHING CURRICULAR EFFECTIVENESS
The committee distinguished two different aspects of determining curricular effectiveness. First, each individual study should demonstrate that it has obtained a level of scientific validity. In the committee’s view, for a study to be scientifically valid, it should address the components identified in the framework and it should conform to the methodological expectations of the appropriate category of evaluation as discussed in the report (content analysis, comparative study, or case study).

Defining scientific validity for individual studies is an essential element of assuring valid data about curricular effectiveness. However, curricular effectiveness cannot be established by a single scientifically valid study; instead a body of studies is needed, which is the second key aspect of determining effectiveness.

Curricular effectiveness is an integrated judgment based on interpretation of a number of scientifically valid evaluations that combine social values, empirical evidence, and theoretical rationales.
Furthermore, a single methodology, even replications and variations of a study, is inadequate to establish curricular effectiveness, because some types of critical information will be lacking. For example, a content analysis is important because, through expert review of the curriculum content, it provides evidence about such things as the quality of the learning goals or topics that might be missing in a particular curriculum. But it cannot determine whether that curriculum, when actually implemented in classrooms, achieves better outcomes for students. In contrast, a comparative study can provide evidence of improvement in student learning in real classrooms across different curricula. Yet without the kind of complementary evidence provided in a content analysis, nothing will be known about the quality or comprehensiveness of the content in the curriculum that produced better outcomes. Furthermore, neither content analyses nor comparative studies typically provide information about the quality of the implementation of a particular curriculum. A case study provides deep insight into issues of implementation; by itself, though, it cannot establish representativeness or causality.

This conclusion—that multiple methods of evaluation strengthen the determination of effectiveness—led the committee to recommend that a curricular program’s effectiveness should be ascertained through the use of multiple methods of evaluation, each of which is a scientifically valid study. Periodic synthesis of the results across evaluation studies should also be conducted.

This is a general principle for the conduct of evaluations in recognition that curricular effectiveness is an integrated judgment, continually evolving, and based on scientifically valid evaluations. The committee further recognized, however, that agencies, curriculum developers, and evaluators need an explicit standard by which to decide when federally funded curricula (or curricula from other sources whose adoption and use may be supported by federal monies) can be considered effective enough to adopt. The committee proposes a rigorous standard to which programs should be held to be scientifically established as effective.

In this standard, the committee recommends that a curricular program be designated as scientifically established as effective only when it includes a collection of scientifically valid evaluation studies addressing its effectiveness that establish that an implemented curricular program produces valid improvements in learning for students, and when it can convincingly demonstrate that these improvements are due to the curricular intervention. The collection of studies should use a combination of methodologies that meet these specified criteria: (1) content analyses by at least two qualified experts (a Ph.D.-level mathematical scientist and a Ph.D.-level mathematics educator) (required); (2) comparative studies using experimental or quasiexperimental designs, identifying the comparative curriculum (required); (3) one or more case studies to investigate the relationships among the implementation of the curricular program and the program components (highly desirable); and (4) a final report, to be made publicly available, should link the analyses, specify what they convey about the effectiveness of the curriculum, and stipulate the extent to which the program’s effectiveness can be generalized (required). This standard relies on the primary methodologies identified in our review, but we acknowledge the possibility of other configurations, provided they draw on the framework and the definition of scientifically valid studies and include careful review and synthesis of existing evaluations.

In its review, the committee became concerned about the lack of independence of some of the evaluators conducting the studies; in too many cases, individuals who developed a particular curriculum were also members of the evaluation team, which raised questions about the credibility of the evaluation results. Thus, to ensure the independence and impartiality of evaluations of effectiveness, the committee also recommends that summative evaluations be conducted by independent evaluation teams with no membership by authors of the curriculum materials or persons under their supervision.
In the body of this report, the committee offers additional recommended practices for evaluators, which include:

Representativeness. Evaluations of curricular effectiveness should be conducted with students that represent the appropriate sampling of all intended audiences.

Documentation of implementation. Evaluations should present evidence that provides reliable and valid indicators of the extent, quality, and type of the implementation of the materials. At a minimum, there should be documentation of the extent of coverage of curricular material (what some investigators referred to as “opportunity to learn”) and of the extent and type of professional development provided.

Curricular validity of measures. A minimum of one of the outcome measures used to determine curricular effectiveness should possess demonstrated curricular validity. It should comprehensively sample the curricular objectives in the course, validly measure the content within those objectives, ensure that teaching to the test (rather than the curriculum) is not feasible or likely to confound the results, and be sensitive to curricular changes.

Multiple student outcome measures. Multiple forms of student outcomes should be used to assess the effectiveness of a curricular program. Measures should consider persistence in course taking, drop-out or failure rates, as well as multiple measures of a variety of the cognitive skills and concepts associated with mathematics learning.
Furthermore, the committee offers recommendations about how to strengthen each of the three major curriculum evaluation methodologies.

Content analyses. A content analysis should clearly indicate the extent to which it addresses the following three dimensions:
1. Clarity, comprehensiveness, accuracy, depth of mathematical inquiry and mathematical reasoning, organization, and balance (disciplinary perspectives).
2. Engagement, timeliness and support for diversity, and assessment (learner-oriented perspectives).
3. Pedagogy, resources, and professional development (teacher- and resource-oriented perspectives).

In considering these dimensions, specific evidence of each should be provided to support their judgments. A content analysis should be acknowledged as a connoisseurial assessment and should include identified credentials and statements of preference and bias of the evaluators.

Comparative analyses. As a result of our study of the set of 63 at least minimally methodologically adequate comparative analyses, the committee recommends that in the conduct of all comparative studies, explicit attention be given to the following criteria:

• Identify comparative curricula by name;
• Employ random assignment, or otherwise establish adequate comparability;
• Select the appropriate unit of analysis;
• Document extent of implementation fidelity;
• Select outcome measures that can be disaggregated by content strand;
• Conduct appropriate statistical tests and report effect size;
• Disaggregate data by gender, race/ethnicity, socioeconomic status (SES), and performance levels, and express constraints as to the generalizability of study.

The committee recognized the need to strengthen the conduct of comparative studies in relation to the criteria listed above. It also recognized that much could be learned from the subgroup (n=63) identified as “at least minimally methodologically adequate.” In fields in their infancy, evaluators and researchers must pry apart issues of method from patterns of results. Such a process requires one to subject the studies to alternative interpretation; to test results for sensitivity or robustness to changes in design; to tease out among the myriad of variables, the ones most likely to produce, interfere with, suppress, modify, and interact with the outcomes; and to build on results of previous studies. To fulfill the charge to inform the conduct of future studies, in Chapter 5 the committee designed and conducted methods to test the patterns of results under varying conditions, and to determine which patterns were persistent or ephemeral. We used these analyses as a baseline to investigate the question, Does the application of increasing standards of rigor have a systemic effect on the results?

In doing so, we report the patterns of results separately for evaluations of NSF-supported and commercially generated programs because NSF-supported programs had a common set of design specifications including consistency with the National Council of Teachers of Mathematics (NCTM) Standards, reliance on manipulatives, drawing topics from statistics, geometry, algebra and functions, and discrete mathematics at each grade level, and strong use of calculators and computers. The commercially supported curricula sampled in our studies varied in their use of these curricular approaches; further subdivisions of these evaluations are also presented in the report. The differences in the specifications of the two groups of programs make their evaluative procedures and hence the validation of those procedures so unlike each other, that combining them into a single category could be misleading.

One approach taken was to filter studies by separating those that met a particular criterion of rigor from those that did not, and to study the effects of that filter on the pattern of results as quantified across outcome measures into the proportion of findings that were positive, negative, or indeterminate (no significant difference). First, we found that on average the evaluations of the NSF-supported curricula (n=46) in this subgroup had reported stronger patterns of outcomes in favor of the experimental curricula than had the evaluations of commercially generated curricula (n=17). Again we emphasize that due to our call for increased methodological rigor and the use of multiple methods, this result is not sufficient to establish the curricular effectiveness of these programs as a whole with adequate certainty. However, this result does provide a testable hypothesis, a starting point for others to examine, critique, and undertake further studies to confirm or disconfirm. Then, after applying the criteria listed above, we found that the comparative studies of both NSF-supported and commercially generated curricula that had used the more rigorous criteria never produced contrary conclusions about curricular effectiveness (compared with less rigorous methods). Furthermore, when the use of more rigorous criteria did lead to significantly different results, these results tended to show weaker findings about curricular effects on student learning. Hence, this investigation reinforced the importance of methodological rigor in drawing appropriate inferences of curricular effectiveness.

Case studies. Case studies should meet the following criteria:
• Stipulate clearly what they are cases of, how claims are produced and backed by evidence, and what events are related or left out and why; and
• Identify explicit underlying mechanisms to explain a rich variety of research evidence.
The case studies should provide documentation that the implementation and outcomes of the program are closely aligned and consistent with the curricular program components and add to the trustworthiness of implementation and to the comprehensiveness and validity of the outcome measures.

The committee recognizes the value of diverse curricular options and finds continuing experimentation in curriculum development to be essential, especially in light of changes in the conduct and use of mathematics and technology. However, it should be accompanied by rigorous efforts to improve our conduct of evaluation studies, strengthening the results by learning from previous efforts.

RECOMMENDATIONS TO FEDERAL AGENCIES, STATE AND LOCAL DISTRICTS AND SCHOOLS, AND PUBLISHERS
Responsibility for curricular evaluation is shared among three primary bodies: the federal agencies that develop curricula, publishers, and state and local districts and schools. All three bodies can and should use the framework and guidelines in designing evaluation programs, sponsoring appropriate data collections, reviewing evaluation proposals, and assessing evaluation studies. The committee has identified several short- and long-term actions that these bodies can take to do so.

At the federal level, such actions include:
• Specifying more explicit expectations in requests for proposals for evaluation of curricular initiatives and increasing sophistication in methodological choices and quality;
• Denying continued funding for major curricular programs that fail to present evaluation data from well-designed, scientifically valid studies;
• Charging a federal agency with responsibility for collecting and maintaining district- and school-level data on curricula; and
• Providing training, in concert with state agencies, to district and local agencies on conducting and interpreting studies of curricular effectiveness.
For publishers, such actions include:
• Differentiating market research from scientifically valid evaluation studies; and
• Making evaluation data available to potential clients who use federal funds to purchase curriculum materials.
At the state level, such actions include:
• Developing resources for district- and state-level collection and maintenance of data on issues of curricular implementation; and
• Providing districts with training on how to conduct feasible, cost-efficient, and scientifically valid studies of curricular effectiveness.
At the district and local levels, such actions include:
• Improving methods of documenting curricular use and linking it to student outcomes;
• Maintaining careful records of teachers’ professional development activities related to curricula and content learning; and
• Systematically ensuring that all study participants have had fair opportunities to learn sufficient curricular units, especially under conditions of student mobility.
Finally, the committee believes there is a need for multidisciplinary basic empirical research studies on curricular effectiveness. The federal government and publishers should support such studies on topics including, but not limited to:
• The development of outcome measures at the upper level of secondary education and at the elementary level in non-numeration topics that are valid and precise at the topic level;
• The interplay among curricular implementation, professional development, and the forms of support and professional interaction among teachers and administrators at the school level;
• Methods of observing and documenting the type and quality of instruction;
• Methods of parent and community education and involvement, and
• Targets of curricular controversy such as the appropriate uses of technology; the relative use of analytic, visual, and numeric approaches; or the integration or segregation of the treatment of subfields, such as algebra, geometry, statistics, and others.
The committee recognizes the complexity and urgency of the challenge the nation faces in establishing effectiveness of mathematics curricula, and argues that we should avoid seemingly attractive, but oversimplified, solutions. Although the corpus of evaluation studies is not sufficient to directly resolve the debates on curricular effectiveness, we believe that in the controversy surrounding mathematics curriculum evaluation, there is an opportunity. This opportunity should not be missed to forge solutions through negotiation of perspective, to base our arguments on empirical data informed by theoretical clarity and careful articulation of values, and to build in an often-missing measure of coherence to curricular choice, and feedback from careful, valid, and rigorous study. Our intention in presenting this report is to help take advantage of that opportunity.

On Evaluating Curricular Effectiveness
JUDGING THE QUALITY OF K-12 MATHEMATICS EVALUATIONS
Committee for a Review of the Evaluation Data on the Effectiveness of NSF-Supported and Commercially Generated Mathematics Curriculum Materials
Mathematical Sciences Education Board
Center for Education
Division of Behavioral and Social Sciences and Education
NATIONAL RESEARCH COUNCIL OF THE NATIONAL ACADEMIES
THE NATIONAL ACADEMIES PRESS
Washington, D.C. www.nap.edu

Posted by Barb Schrank at November 30, 2004 03:37 PM | TrackBack

Comments