» Standards, Accountability, and School Reform

This is very long, and the link may require a password so I’ve posted the entire article on the continued page.
TJM
http://www.tcrecord.org/PrintContent.asp?ContentID=11566
Standards, Accountability, and School Reform
by Linda Darling-Hammond — 2004
The standards-based reform movement has led to increased emphasis on tests, coupled with rewards and sanctions, as the basis for “accountability” systems. These strategies have often had unintended consequences that undermine access to education for low-achieving students rather than enhancing it. This article argues that testing is information for an accountability system; it is not the system itself. More successful outcomes have been secured in states and districts, described here, that have focused on broader notions of accountability, including investments in teacher knowledge and skill, organization of schools to support teacher and student learning, and systems of assessment that drive curriculum reform and teaching improvements.

The education reform movement in the United States focused increasingly on the development of new standards for students: Virtually all states have begun the process of creating standards for student learning, new curriculum frameworks to guide instruction, and new assessments to test students’ knowledge. School districts across the country have weighed in with their own versions of standards-based reform, including new curricula, testing systems, accountability schemes, and promotion or graduation requirements.
The rhetoric of these reforms is appealing. Students cannot succeed in meeting the demands of the new economy if they do not encounter much more challenging work in school, many argue, and schools cannot be stimulated to improve unless the real accomplishments─ or deficits─ of their students are raised to public attention. There is certainly merit to these arguments. But will standards and tests improve schools or create educational opportunities where they do not now exist? What evidence do we have about the success of standards-based reform strategies, especially for the students in America’s urban school systems where educational needs are greatest? In this paper I review evidence about the outcomes of different approaches to standards-based reform in states and districts across the country with an eye toward evaluating whether and how they improve educational opportunities and student learning.
ALTERNATIVE VIEWS OF STANDARDS-BASED REFORM
Some proponents of standards-based reforms have envisioned that standards that express what students should know and be able to do would spur other reforms that mobilize more resources for student learning, including high quality curriculum frameworks, materials, and assessments tied to the standards; more widely available course offerings that reflect this high quality curriculum; more intensive teacher preparation and professional development guided by related standards for teaching; more equalized resources for schools; and more readily available safety nets for educationally needy students (O’Day & Smith, 1993). For others, the notions of standards and accountability have become synonymous with mandates for student testing which may have little connection to policy initiatives that directly address the quality of teaching, the allocation of resources, or the nature of schooling (see, e.g., Educate America, 1991).
In addition to these differences, distinct change theories have emerged around the idea of standards-based reform. Some argue that standards for learning and teaching should be used primarily to inform investments and curricular changes that will strengthen schools. They see the major problem as a need for teacher, school, and system learning about more effective practice combined with more equal and better-targeted resource allocation. Others argue that standards can motivate change only if they are used to apply sanctions to those who fail to meet them. They see the major problem as a lack of effort and focus on the part of educators and students.
Policy makers who endorse the latter view have emphasized high-stakes testing─ that is, the use of scores on achievement tests to make decisions that have important consequences for examinees and others─ as a primary strategy to promote accountability. Some high-stakes decisions affect students, such as the use of test scores for promotion, tracking and graduation. Others affect teachers and principals when scores are used to determine merit pay or potential dismissal. Still others affect schools, as when schools are awarded recognition or extra funds when scores increase or are put into intervention status or threatened with loss of registration when scores are low. Some policies take into account differences in the initial performance of students and in the many nonschool factors that can affect achievement. Some do not, holding schools to similar standards despite dissimilar student populations and resources.
Many questions arise from this policy strategy. Will investments in better teaching, curriculum, and schooling follow the press for new standards? Or will standards and tests built upon a foundation of continued inequality simply certify student failure more visibly and reduce access to future education and employment? In states where standards accompanied by high-stakes tests have been imposed without addressing inequalities in access to qualified teachers and appropriate, a new generation of equity lawsuits has emerged. Litigation in California, Florida, New York, and elsewhere has followed on the heels of recently successful ‘‘adequacy’’ lawsuits in Alabama and New Jersey.
A growing body of research has found unintended consequences of high-stakes tests. Some studies have found that high-stakes tests can narrow the curriculum, pushing instruction toward lower order cognitive skills, and can distort scores (Klein, Hamilton, McCaffrey, & Stretcher, 2000; Koretz & Barron, 1998; Koretz, Linn, Dunbar, & Shepard, 1991; Linn, 2000; Linn, Graue, & Sanders, 1990; Stetcher, Barron, Kaganoff, & Goodwin, 1998). In addition, grade retention as a response to low test scores appears not to improve educational achievement for those who are held back and increases their likelihood of dropping out (Hauser, 1999). Finally, there is evidence that high-stakes tests that reward or sanction schools based on average student scores can create incentives for pushing low-scorers into special education, holding them back in the grades, and encouraging them to drop out so that schools’ average scores will look better (Allington & McGill-Franzen, 1992; Darling-Hammond, 1991, 1992; Figlio & Getzler, 2002; Haney, 2000; Koretz, 1988; Shepard & Smith, 1988; Smith et al., 1986). School rankings tied to test scores have sometimes punished schools for accepting and keeping students with high levels of special needs and rewarded them for keeping such students out of their programs through selective admissions, transfer, and even push out policies (Smith et al., 1986).
In a recent paper citing concerns about the negative outcomes of test-based promotion and graduation policies, Robert Hauser (1999) voiced skepticism about whether many states’ or districts’ high-stakes testing policies are likely to result in positive consequences for students:
It is possible to imagine an educational system in which test-based promotion standards are combined with effective diagnosis and remediation of learning problems, yet past experience suggests that American school systems may not have either the will or the means to enact such fair and effective practices. Such a system would include well-designed and carefully aligned curricular standards, performance standards, and assessments. Teachers would be well trained to meet high standards in their classrooms, and students would have ample notice of what they are expected to know and be able to do. Students with learning difficulties would be identified years in advance of high-stakes deadlines, and they and their parents and teachers would have ample opportunities to catch up before deadlines occur. Accountability for student performance would not rest solely or even primarily on individual students, but also, collectively, on educators and parents. There is no positive example of such a system in the United States, past or present, whose success is documented by credible research. (p. 3)
Hauser’s concerns appear apt, given the research on such policies that has been available to date. In this paper, I review additional data indicating on the outcomes of test-based accountability systems. I also examine research on urban districts that have substantially improved their students’ performance by focusing on the improvement of teaching (by attending to professional accountability) rather than on sanctions for students (by emphasizing test-based accountability). In the course of this article, I argue for a broader conception of accountability that examines whether the actions undertaken by policymakers in fact produce better quality education and higher levels of learning for a greater share of students and whether they work to address shortcomings in children’s opportunities to learn.
TYPES OF EDUCATIONAL ACCOUNTABILITY
To expand our frame for examining accountability, it may be useful to recognize that there are many different conceptions of accountability that have influenced U.S. education policy and interact with one another in today’s systems. They include at least the following:

Political accountability: Legislators and school board members, for example, must regularly stand for election and answer for their decisions.
Legal accountability: Schools are to operate in accord with legislation, and citizens can ask the courts to hear complaints about the public schools’ violation of laws.
Bureaucratic accountability: Federal, state, and district offices promulgate rules and regulations intended to ensure that schooling takes place according to set procedures.
Professional accountability: Teachers and other staff are expected to acquire specialized knowledge, meet standards for entry, and uphold professional standards of practice in their work.
Market accountability: Parents and students may in some cases choose the courses or schools they believe are most appropriate (Darling-Hammond, 1989).

All of these accountability mechanisms have their strengths and limitations, and each is more or less appropriate for certain goals. Political mechanisms can help establish general policy directions, but they do not allow citizens to judge each decision by elected officials, and they do not necessarily secure the rights of minorities. Legal mechanisms are useful in establishing and defending rights, but not everything is subject to court action and not all citizens have access to the courts. Bureaucratic mechanisms are appropriate when standard procedures will produce desired outcomes, but they can be counterproductive when clients have unique needs that require differential responses by those who must make non-routine decisions. Professional mechanisms are important when services require complex knowledge and decision making to meet clients’ individual needs, but they do not always take competing public goals (e.g., cost containment) into account. Market mechanisms are helpful when consumer preferences vary widely and the state has no direct interest in controlling choice, but they do not ensure that all citizens will have access to services of a given quality.
Because of these limits, no single form of accountability operates alone in any major areas of public life. The choices of accountability tools─ and the balance among different forms of accountability─ are constantly shifting as problems emerge, as social goals change, and as new circumstances arise. In most urban public school systems, legal and bureaucratic accountability strategies have predominated over the last 20 or more years. These have especially focused on attempts to manage schooling through standardized educational procedures, prescribed curriculum and texts, and test-based accountability strategies, often tied to tracking and grouping decisions that are meant to determine the programs students will receive.
Few have experimented with market accountability until very recently. Most notable among them are New York City, which launched more than 150 small schools of choice in the 1990s to add to the many dozens that existed before that time, and Cambridge, Massachusetts, which has had a system of choice-based schools for more than 15 years. Finally, a very few urban districts have launched well-developed professional accountability strategies tied to standards for teaching as well as student learning. New York City’s District #2, New Haven, California, and several cities in Connecticut, a state that launched a highly successful state-wide reform focused on teaching quality are among these, and are described later.
STANDARDS AS ASSESSMENT: ATTEMPTS TO CREATE ACCOUNTABILITY THROUGH HIGH-STAKES TESTING
Since the mid-1800s, urban school systems have periodically used student test scores to allocate rewards or sanctions to schools or teachers. (For historical accounts, see Callahan, 1962; Tyack, 1974.) Many states and districts have approached standards-based reform through this familiar strategy, claiming to implement new standards even when the tests are not aligned to the standards and when students are not assured of receiving qualified teachers, curriculum aligned with the standards, or schools organized to support them. ‘‘Standards-based reform strategies’’ that have used test scores as the basis for promoting students from grade to grade, determining program placements (e.g., to compensatory or gifted and talented classes), and making graduation decisions have received a great deal of publicity in the mid- to late-1990s as ‘‘new’’ reforms; however, they replicate policies that have come and gone many times before.
In contrast to schools in most European and Asian countries, U.S. schools have a long tradition of retaining students in a grade if they seem not to be succeeding at school. It has been estimated that the United States has an overall retention rate of 15–20% of its students annually (most of them at-risk students in central cities), placing U.S. public schools on a par with countries like Haiti or Sierra Leone and in stark contrast with countries like Japan, which has less than a 1% rate of grade retention, and European nations that bar grade retention (Smith & Shepard, 1987; Hauser, 1999). During the early 1980s, grade retentions increased as school districts instituted policies that linked standardized test scores to student promotion and placement decisions. Many of these policies failed and were repealed by the late 1980s, only to be reinstated less than a decade later.
For example, New York City experienced many of the problems associated with grade retention when the Promotional Gates Program was put in place in elementary and junior high schools during the early 1980s. At that time, gateways in grades four and seven were created through which students could pass only if they demonstrated a specified level of performance on the standardized citywide reading and mathematics tests. Students who did not meet the minimum standards were retained, sometimes repeatedly, until they were able to achieve the necessary score on the tests. Instead of strengthening most students’ academic performance, however, the program created cohorts of students who had been retained repeatedly without learning gains; sometimes they had been held back for so long that their advanced age and physical size led to increased misbehavior and decreased achievement for both the retained students and others in their classrooms. The students retained had lower achievement, greater incidences of disciplinary difficulties, and higher dropout rates than students at similar achievement levels who had previously been promoted. A district study found that 40% of the students retained in seventh grade had dropped out within 4 years, as compared to 25% of a comparison group, and that, while those who received intensive services in the Gates year improved their achievement temporarily, neither the services nor the students’ progress were sustained (New York City Division of Assessment and Accountability, 2001). Eventually, in the face of national and local evidence about the failures of this approach, the program was ended by Chancellor Fernandez in the late 1980s (Gampert & Opperman, 1988).
A decade later, with no sense of irony or institutional memory, the New York Times reported in September, 1999, that 21,000 students would be held back under the City’s ‘‘new’’ policy to end social promotion (Wasserman, 1999). Two weeks later the newspaper reported that the social promotion policy was in disarray as two-thirds of the 35,000 students forced to take summer school still did not pass the tests and, further, that 4,500 students’ test scores had been misreported and as many as 3,000 had been forced to take summer school by mistake (Hartocollis, 1999). Similar news headlines appeared in Los Angeles, where a policy to ‘‘end social promotion’’ resulted in more than 10,000 students being threatened with grade retention, only to find that the schools could not accurately identify who had passed or failed and could not find qualified teachers to teach the summer school programs that were supposed, miraculously, to catch these students up. The New York City Division of Assessment and Accountability (2001) has noted that a sharp increase in dropout rates between the classes of 1998 and 2000 (from 15.6% to 19.3% of each class) is likely a function of both the ‘‘new’’ city promotional standards and the state’s new test-based graduation requirements.
These outcomes have been replicated in other recent test-based promotion and graduation reforms. For example, the much publicized Chicago effort, which sought to end social promotion by requiring test passage at Grades 3, 6, and 8, appears to have failed to improve the learning of the thousands of students it retained. In the first two years under the policy, more than one-third of third, sixth, and eighth graders failed to meet the promotional test cutoffs by the end of the school year. Despite the fact that there were large-scale waivers for students with limited English proficiency and special education students, more than 20,000 students were retained in grade in 1997 and 1998, during the first two years of the program. Although average test scores improved, an evaluation by Consortium on Chicago School Research concluded that:
Retained students did not do better than previously socially promoted students. The progress among retained third graders was most troubling. Over the two years between the end of second grade and the end of the second time through third grade, the average ITBS reading scores of these students increased only 1.2 GEs (grade equivalents) compared to 1.5 GEs for students with similar test scores who had been promoted prior to the policy. Also troubling is that one-year dropout rates among eighth graders with low skills are higher under this policy. . . . In short, Chicago has not solved the problem of poor performance among those who do not meet the minimum test cutoffs and are retained. Both the history of prior attempts to redress poor performance with retention and previous research would clearly have predicted this finding. Few studies of retention have found positive impacts, and most suggest that retained students do not better than socially promoted students. The CPS policy now highlights a group of students who are facing significant barriers to learning and are falling farther and farther behind. (Roderick, Bryk, Jacob, Easton, & Allensworth, 1999, pp. 55–56)
These findings confirm those of a substantial body of research that has demonstrated that retaining students does not appear to help them catch up with peers and succeed in school; however, it does contribute to high rates of academic failure and behavioral difficulties. Studies comparing the learning gains of students who were retained with those of academically comparable students who were promoted have typically found that retained students actually achieve less than their comparable peers who move on through the grades. Students do not appear to benefit academically from grade retention regardless of the grade level or the student’s initial achievement level (for reviews, see Baenen, 1988; Holmes & Matthews, 1984; Illinois Fair Schools Coalition, 1985; Labaree, 1984; Meisels, 1992; Oakes & Lipton, 1990; Ostrowski, 1987). Shephard and Smith (1986) conclude in their review of research: ‘‘Contrary to popular beliefs, repeating a grade does not help students gain ground academically and has a negative impact on social adjustment and self-esteem’’ (p. 86).
When students who were retained in a grade are compared with students of equal achievement levels who were promoted, the retained students consistently suffer poorer self–concepts, have more problems of social adjustment, and express more negative attitudes toward school at the end of the period of retention than do similar students who are promoted (Eads, 1990; Holmes & Matthews, 1984; Illinois Fair Schools Coalition, 1985; Shepard & Smith, 1988; Walker & Madhere, 1987).
In addition, many studies have found that grade retention increases dropout rates (Anderson, 1998; Hess, 1986; Hess, Ells, Prindle, Liffman, and Kaplan, 1987; Safer, 1986;Smith & Shepard, 1987; Temple, Reynolds, & Miedel, 1998). Researchers have found that the odds of dropping out increase significantly for retained students, increasing the probabilities from 70% (Anderson, 1998) to as much as 250% (Rumberger & Larson, 1998) above those of similar students who were not retained.
The notion of holding students back is a crude remedy for educational problems derived from the factory assembly line model of schooling developed during the early years of the twentieth century: The assumption was that a sequenced set of procedures would be implemented as a child moved along the conveyor belt from 1st to 12th grade. If a particular set of procedures didn’t ‘‘take,’’ the procedures should be repeated until the child was properly ‘‘processed.’’ There are a number of reasons why grade retention is not generally a productive answer to low achievement, however. First, students develop at very different rates, and in the early grades the wide range of development that produces many of the differences in achievement measures evens out by about third or fourth grade. However, students who are held back often develop a conception of themselves as incapable, which then often becomes a self-fulfilling prophecy as it affects their motivation and willingness to attempt difficult tasks. Second, if there is a real problem with a student’s learning, wholesale grade retention does not typically lead to diagnosis of special learning needs or the use of more appropriate teaching strategies targeted to those needs. Finally, grade retention does not address system problems of poor teaching; nor does it promise better teaching in the subsequent year. In fact, low-achieving students are generally assigned to the least experienced and qualified teachers, exacerbating their learning difficulties.
Generally, the premise of grade retention as a solution for poor performance is that the problem, if there is one, resides in the child, rather than in the school setting. Rather than looking carefully at classroom practices and student needs when students are not achieving, schools send students back to repeat the same experience over again. Very little is done to ensure that the experience will be either higher in quality or more appropriate for the individual needs of the child. In short, grade retention provides little accountability for the quality of the educational experience students receive.
While it is certainly true that both students and their parents bear a measure of accountability for attending school, putting forth effort, and striving to meet expectations (and policies that set standards appropriately seek to mobilize those efforts), it is important for accountability policies to fairly assess what children and parents can do and what they system must do to enable successful efforts. This is especially important given the clear evidence that children in the United States receive dramatically unequal access to high-quality curriculum and teaching, and that these differentials are strongly related to their achievement (see Darling-Hammond, 1997, for a review).
Despite the rhetoric of American equality, the school experiences of students of color in the United States continue to be substantially separate and unequal. More than two thirds of ‘‘minority’’ students attend predominantly minority schools, and one third of Black and Latino students attend intensely segregated schools (i.e., 90% or more minority enrollment), most of which are in central cities (Orfield & Gordon, 2001). Currently, about two thirds of all students in central city schools are Black or Hispanic (National Center for Education Statistics, 1997a). This concentration facilitates inequality. Not only do funding systems and tax policies leave most urban districts with fewer resources than their suburban neighbors, but schools with high concentrations of low-income and ‘‘minority’’ students receive fewer resources than other schools within these districts. And tracking systems exacerbate these inequalities by segregating many ‘‘minority’’ students within schools, allocating still fewer educational opportunities to them at the classroom level.
In their review of resource allocation studies, MacPhail-Wilcox and King (1986) summarized the resulting situation as follows:
School expenditure levels correlate positively with student socioeconomic status and negatively with educational need when school size and grade level are controlled statistically. . . .Teachers with higher salaries are concentrated in high income and low minority schools. Furthermore, pupil-teacher ratios are higher in schools with larger minority and low-income student populations. . . . Educational units with higher proportions of low-income and minority students are allocated fewer fiscal and educational resources than are more affluent educational units, despite the probability that these students have substantially greater need for both. (p. 425)
The situation has not improved in most states over the last decade and has grown substantially worse in some, as recent lawsuits challenging inequalities in Alabama, California, Louisiana, New Jersey, New York, and elsewhere have demonstrated. In combination, policies associated with school funding, resource allocations, and tracking leave poor and minority students with fewer and lower quality books, curriculum materials, laboratories, and computers; significantly larger class sizes; less qualified and experienced teachers; and less access to high quality curriculum. The fact that the least qualified teachers typically end up teaching the least advantaged students is particularly problematic, given recent studies that have found that teacher quality is one of the most important determinants of student achievement (for a review, see Darling-Hammond, 2000). Low-income and minority students are least likely to receive well-qualified, highly effective teachers (National Center for Education Statistics, 1997a; Sanders & Rivers, 1996). Some evidence suggests that differences in the quality of teachers available to poor and minority children may explain nearly as much of the variance in student achievement as socioeconomic status (Ferguson, 1991; Strauss & Sawyer, 1986).
Unequal access to qualified teachers exacerbates the disparate effects of test-based promotion and graduation policies. Nationally, retention rates for low-income children are at least twice those for high-income students. Students who are retained in grade are disproportionately representative of racial and ethnic and populations whose dominant language is other than English (Illinois Fair Schools Coalition, 1985; Shepard & Smith, 1986; Walker & Madhere, 1987). Thus, the students who receive the scantiest resources, the least qualified teachers, the poorest physical facilities, and the most restricted access to quality learning opportunities are supposed to be ‘‘fixed’’ by being held back.
The Chicago study noted that the failure to invest in improved teaching was an unrecognized problem in the city’s reform strategy, which had tried to rely on a highly scripted centrally developed curriculum (which by design assumes, inaccurately, that students learn in the same ways and at the same pace) and grade retention as its major tools. The authors noted: ‘‘Thus the administration has worked to raise test scores among low-performing students without having to address questions regarding the adequacy of instruction during the school day or spend resources to increase teachers’ capacity to teach and to meet students’ needs more successfully’’ (Roderick et al., 1999, p. 57).
Where the failure to learn is a result of inadequate teaching and where the system’s primary response is to require children to experience that inadequate teaching again, it is doubtful that such a policy increases the system’s accountability to parents and students. The educational system’s accountability to the greater society is also reduced when a side-effect of the policy is that large numbers of students drop out of school, thus creating a societal burden of undereducated youth who are unable to function in the labor market and who increasingly join the welfare or criminal justice systems rather than the productive economy. Society as a whole does not benefit from school policies that claim to heighten accountability by pushing low achievers out of school to make test scores look better─ a result that has been documented in several studies─ or by failing to offer education that enables these students to learn.
INSTITUTIONAL RESPONSES TO TEST-BASED INCENTIVES
Unfortunately, most cities and states have used test-based reform strategies that rely on cross-sectional measures of student scores for different populations of students (e.g., average scores for eighth graders in a given year are compared to average scores for a different group of eighth graders in the prior year), rather than longitudinal assessments of student gains for students who remained in a given school over a period of time. Because schools’ average scores on any measure are sensitive to changes in the population of students taking the test, and such changes can be induced by manipulating admissions, dropouts, and pupil classifications, policies that use schools’ average scores for allocating sanctions have been found to result in several unintended negative consequences. As noted earlier, these include labeling low-scoring students for special education placements so that their scores won’t ‘‘count’’ in school reports, retaining students in grade so that their relative standing will look better on ‘‘grade-equivalent’’ scores, excluding low-scoring students from admission to ‘‘open enrollment’’ schools, and encouraging such students to leave schools or drop out. This occurs because the policies create incentives for schools to keep out of the testing pool─ or the school itself─ students who will lower the average scores. Smith and colleagues explained the widespread engineering of student populations that he found in his study of New York City’s implementation of performance standards as a basis for school level sanctions:
(S)tudent selection provides the greatest leverage in the short-term accountability game. . . . The easiest way to improve one’s chances of winning is (1) to add some highly likely students and (2) to drop some unlikely students, while simply hanging on to those in the middle. School admissions is a central thread in the accountability fabric. (Smith et al., 1986, pp. 30–31)
In some cases, policies that reward or punish schools for average test scores have created a distorted view of accountability, one in which beating the numbers by manipulating student placements overwhelms efforts to serve students’ educational needs well. These policies may also further exacerbate existing incentives for talented staff to opt for school placements where students are easy to teach, and school stability is high. Capable staff are less likely to risk losing rewards or incurring sanctions by volunteering to teach where many students have special needs and performance standards will be more difficult to attain. This outcome was recently reported as a result of Florida’s recent use of aggregate test scores, reported as cross-sectional averages and unadjusted for student characteristics, for school rewards and sanctions. Qualified teachers were leaving the schools rated D or F ‘‘in droves’’ according to news reports at the start of the 1999 school year (DeVise, 1999; Fischer, 1999), to be replaced by teachers without experience and often without training. As one principal queried, ‘‘Is anybody going to want to dedicate their lives to a school that has already been labeled a failure?’’
Ironically, this approach to accountability compromises even further the educational chances of disadvantaged students, who are already served by a disproportionate share of those teachers who are inexperienced, unprepared, and underqualified. This outcome will be further exacerbated by policies that plan to reduce federal funds to schools that have lower test scores. Critics have argued that applying sanctions to schools with lower test score performance penalizes already disadvantaged students twice over: having given them inadequate schools to begin with, society now punishes them again for failing to perform as well as other students who attend schools with greater resources. Such sanctions can discourage good schools from opening their doors to educationally needy students and place more emphasis on manipulating scores by eliminating or keeping out low-scoring students than on improving schools.
These outcomes have been noted of reforms in several states. For example, after the Regents Test reforms of the early 1980s in New York State, studies found evidence of schools retaining students and placing them in special education to increase average school performance in critical grade levels used as benchmarks for accountability policies (Allington & McGill-Franzen, 1992) and encouraging low-scoring secondary students to leave school entirely (Smith et al., 1986). By 1992, New York’s graduation rates had dropped to only 62%, leaving the state ranked 45th in the country on this measure (Feistritzer, 1993).
Similarly, Atlanta, Georgia, instituted a pupil progression policy in 1980 based on test score thresholds for each elementary grade. High failure rates and repeated retentions led to increased dropout rates. The high school completion rate in Atlanta dropped to 65% by 1982 and to 61% by 1988. A 1988 state policy set up additional test thresholds for promotion and graduation. This policy exacerbated the declines in graduation in Atlanta and elsewhere across the state. As Gary Orfield and Carole Ashkinaze (1991) noted:
Although most of the reforms were popular, the policymakers and educators simply ignored a large body of research showing that they would not produce academic gains and would increase dropout rates. In other words, this was a policy with no probable educational benefits and large costs. The benefits were political and the costs were borne by at-risk students. The damage was psychological as well as educational, increasing the likelihood that at-risk students would drop out before receiving their diplomas; school districts were also hurt by the diversion of resources to repetitive years of education for many students. (p. 139)
An analysis of the test-based reform strategies enacted in 1983 and 1984 in Georgia and South Carolina, both of which tied rewards and sanctions to annual tests at each grade level found that neither state realized gains in achievement on the National Assessment of Educational Progress during the 1990s, although both experienced declines in high school graduation rates (Darling-Hammond, 2000). (See Figure 1.)
Recent analyses of test-based reforms instituted in Texas in the 1980s have pointed to these and other problems. Although ostensible gains in
Figure 1. Student Achievement in Reading National Assessment of Educational Progress, 1992–1998 scores on the TAAS tests have caused the state’s reforms to be hailed as the Texas Miracle, a number of studies have suggested that the outcomes may be less positive than they appear. First, studies by the Center for Research and Evaluation on Testing (Haney, 2000) and by the Intercultural Development Research Association (1996) have found that both retention rates in ninth grade and dropout or attrition rates for high school students increased substantially since the 1980s. Both studies found that fewer than 50% of African American and Latino ninth graders progress to graduation 4 years later, and only about 70% of White ninth graders reach graduation. Haney (2000) found evidence that a growing number of low-scoring students leave school as early as eighth or ninth grade, before their scores are factored into school accountability rankings. The effects are most pronounced for students of color:
In 1990–91, Black and Hispanic high school graduates relative to the number of Black and Hispanic students enrolled in grade 9 three years earlier fell to less than 0.50 and this ratio remained just about at or below this level from 1992 to 1999. (The corresponding ratio had been about 0.60 in the late 1970s and early 1980s). . . . From 1977 until about 1981 rates of grade 9 retention were similar for Black, Hispanic, and White students, but since about 1982, the rates at which Black and Hispanic students are denied promotion and required to repeat grade 9 have climbed steadily, such that by the late 1990s, nearly 30% of Black and Hispanic students were ‘‘failing’’ grade 9 and required to repeat that grade.
Haney’s report and Texas Education Agency (TEA) analyses agree that dropout rates in Texas are substantially higher for students retained in ninth grade than for any other group.
TEA data find that rates of dropping out are at least 3 times higher for this group, even though they provide a rosier picture of overall graduation rates, since they do not count as dropouts the large number of students who are transferred to GED programs and fail to finish them.
Several recent studies have produced empirical data that cast doubt on the gains noted on the state TAAS tests, observing that Texas students have not made comparable gains on national standardized tests or on the state’s own college entrance test (Haney, 2000; Gordon & Reese, 1997; Hoffman et al., in press; Klein et al., 2000; Stotsky, 1998). These studies have variously suggested that teaching to the test may be raising scores on the state high-stakes test in ways that do not generalize to other tests that examine a broader set of higher order skills; that many students are excluded from the state tests to prop up average scores; and that passing scores have been lowered and the tests have been made easier over time to give the appearance of gains.
The American Psychological Association, American Educational Research Association, and the National Council on Measurement in Education have issued standards for the use of tests that indicate that test scores are too limited and unstable a measure to be used as the sole source of information for any major decision about student placement or promotion. A recent report of the National Research Council on high stakes testing concluded:
Scores from large-scale assessments should never be the only sources of information used to make a promotion or retention decision. . . . Test scores should always be used in combination with other sources of information about student achievement. (Heubert and Hauser, 1999, p. 286).
The test-based accountability systems in dozens of states and urban school systems stand in contravention to these professional standards. However, the negative effects of grade retention and graduation sanctions should not become an argument for social promotion─ that is, the practice of moving students through the system without ensuring that they acquire the skills that they need. What are the alternatives? There are at least four complementary strategies that evidence suggests can improve student learning without grade retention:

Enhancing preparation and professional development for teachers to ensure that they have the knowledge and skills they need to teach a wider range of students to meet the standards;
Redesigning school structures to support more intensive learning─ including creating smaller school units (within an optimal size of 300– 500) and schools that team teachers to work with smaller total numbers of students for longer periods of time;
Employing school-wide and classroom performance assessments that support more coherent curriculum and better inform teaching; and
Ensuring that targeted supports and services are available for students when they are needed.

Some urban districts have used these strategies to upgrade student learning and to create a more genuine accountability to parents and students. Though all of these districts continue to face difficulties and challenges, their substantial successes offer a very different model for standards-based reform, one that rests on the use of standards and assessments as a stimulus for professional development and curricular reform rather than as punishments for schools and students. Three examples are offered here: the statewide reforms in Connecticut that have supported substantial improvements in a number of cities (featured here are New Britain, Norwalk, and Middletown─ among the state’s lowest-income and once lowest-achieving districts); New York City’s School District #2, and New Haven, California.
Connecticut
Connecticut provides an especially instructive example of how state level policy makers have used a standards-based starting point to upgrade teachers’ knowledge and skills as a means of improving student learning. Since the early 1980s, the state has pursued a purposeful and comprehensive teaching quality agenda. The Connecticut case is a story of how bipartisan state policy makers implemented a coherent policy package over more than 15 years. They used teaching standards, followed later by student standards, to guide investments in school finance equalization, teacher salary increases tied to higher standards for teacher education and licensing, curriculum and assessment reforms, and a teacher support and assessment system that strengthened professional development.
Connecticut’s teacher assessments and preparation requirements ensure that every entering teacher has strong content and pedagogical knowledge to enable him or her to teach a wide range of diverse learners well─ including those who have special education needs and English language learning needs. Standards-based professional development opportunities have dramatically upgraded the knowledge and skills of the veteran teaching population. Student assessments are aimed at higher order thinking and performance skills and are used to evaluate and continually improve practice. While the public reporting system places strong pressure on districts and schools to improve their practice, the student assessments are not used for rewards or punishments for students, teachers, or schools. Rather than pursue a single silver bullet or a punitive approach that creates dysfunctional responses, Connecticut has made ongoing investments in improving teaching and schooling through high standards and high supports.
Dramatic gains in student achievement (accompanied by increases rather than declines in student graduation rates) and a plentiful supply of well-qualified teachers are two major outcomes of this agenda. By 1998, Connecticut’s fourth grade students ranked first in the nation in reading and mathematics on the National Assessment of Educational Progress (NAEP), despite increased student poverty and language diversity in the state’s public schools during that decade (National Center for Education Statistics, 1997b; National Education Goals Panel, 1999). (See Figure 1.) The proportion of Connecticut eighth graders scoring at or above proficient in reading was also first in the nation, and Connecticut was not only the top performing state in writing, but the only one to perform significantly better than the U.S. average. A 1998 study linking the NAEP with the Third International Math and Science Study (TIMSS) found that, in the world, only top-ranked Singapore outscored Connecticut students in science (Baron, 1999). The achievement gap between white students and the growing minority student population is decreasing, and the more than 25% of Connecticut’s students who are Black or Hispanic substantially outperform their counterparts nationally (Baron, 1999).
In explaining Connecticut’s reading achievement gains, a recent National Educational Goals Panel report (Baron, 1999) cited the state’s teacher policies as a critical element, pointing to the 1986 Education Enhancement Act, as the linchpin of the teacher reforms. In this omnibus bill, Connecticut coupled major increases in teacher salaries with greater equalization in funding across districts, higher standards for teacher education and licensing, and substantial investments in beginning teacher mentoring and professional development. An initial investment of $300 million was used to boost minimum beginning teacher salaries in an equalizing fashion that made it possible for low-wealth districts to compete in the market for qualified teachers. The average teacher’s salary increased from a 1986 average of $29,437 to a 1991 average of $47,823 (Fisk, 1999). These grants were provided on an equalizing basis to enable poor districts to better compete in the market for qualified teachers. Districts were given incentives to hire qualified teachers because salary grants were calculated on the basis of fully certified teachers only, and emergency credentials were phased out.
To further ensure an adequate supply of qualified teachers, the state offered incentives including scholarships and forgivable loans to attract high-ability teacher candidates, especially in high-demand fields, and encouraged well-qualified teachers from other states to come to Connecticut through license transportability reforms. An analysis of the outcomes of this set of initiatives found that they eliminated teacher shortages, even in the cities, and created surpluses of teachers within three years of its passage (Connecticut State Department of Education, 1990). These surpluses have been maintained since, allowing districts─ including urban school districts ─ to be highly selective in their hiring and demanding in their expectations for teacher expertise.
At the same time, the state raised teacher education and licensing standards by requiring a major in the discipline to be taught plus extensive knowledge of teaching and learning as part of preparation (including knowledge for all teachers about literacy development and the teaching of special needs students); instituted performance-based examinations in subject matter and knowledge of teaching as a basis for receiving a license; created a state-funded beginning teacher mentoring program which supported trained mentors for beginning teachers in their first year on the job; and created a sophisticated assessment program using state-trained assessors for determining who could continue in teaching after the initial year.
Connecticut also required teachers to earn a master’s degree in education for a continuing license and supported new professional development strategies in universities and school districts. Recently, the state has further extended its performance-based licensing system to incorporate the new INTASC standards1 and to develop portfolio assessments modeled on those of the National Board for Professional Teaching Standards. As part of ongoing teacher education reforms, the state agency has supported the creation of professional development schools linked to local universities and more than 100 school-university partnerships. In addition, Connecticut has developed courses on teacher and student standards that can be applied toward the required master’s degree. The state also funds and operates a set of Institutes for Teaching and Learning.
Connecticut’s portfolio assessments for beginning teacher licensing are modeled on those of the National Board for Professional Teaching Standards; they examine directly whether a teacher is able to teach to Connecticut’s student learning standards in specific content areas. The performance assessments examine teacher plans, videotapes of lessons, student work, and teacher analyses of their practice. They are developed with the assistance of teachers, teacher educators, and administrators: Hundreds of educators are convened to provide feedback on drafts of the standards, and many more are involved in the assessments themselves, as cooperating teachers and school-based mentors who work with beginning teachers on developing their practice, as assessors who are trained to score the portfolios, and as expert teachers and teacher educators who convene regional support seminars to help candidates learn about the standards and the portfolio development process. Preparation is organized around the examination of cases and the development of evidence connected to the standards.
Together, these activities have had far-reaching effects. By one estimate, more than 40% of Connecticut’s teachers have gone through the process as new teachers or have served as assessors, mentors, or cooperating teachers. By the year 2010, 80% of elementary teachers, and nearly as many secondary teachers, will have participated in the new assessment system as candidates, support providers, or assessors. Because the assessments focus on the development of teacher competence, are tightly tied to student standards, and lead to sophisticated analysis of practice, the assessment system serves as a focal point for improving teaching and learning.
In addition to the state’s major investments in teaching quality, the Goals Panel report also pointed to the thoughtful use of student standards and assessments in Connecticut. In 1987, following the teaching reforms, student learning standards were adopted in an early effort to link teacher education standards with expectations for teaching. In 1993–1994, the student standards were updated to emphasize higher order thinking skills and performance abilities, and new assessments were developed; these include constructed response and performance assessments that measure reading and writing authentically and reflect more challenging learning goals than the previous tests.
Also critical is the fact that, in line with professional standards for testing, the law precludes the use of these assessments for promotion or graduation of students. Instead, they are used for ongoing improvements in curriculum and teaching. The Goals Panel report noted the benefits of the state’s low-stakes testing approach, which emphasize reporting and analysis strategies that support the wide dissemination of the standards and test objectives along with widespread professional development around literacy and the teaching of reading. The State Department of Education also supports the use of test results for educational improvement by giving districts computerized data that allow analyses at the district, school, teacher, and individual pupil level. The Department assists districts in analyzing the data in ways that permit diagnosis of needs and areas for concentrated work (Baron, 1999). The state then provides targeted resources to the neediest districts to help them improve, including funding for professional development for teachers and administrators, preschool and all-day kindergarten for students, and smaller pupil-teacher ratios, among other supports.
The Goals Panel study notes that this approach to assessment has enabled districts to clarify their teaching priorities and has helped galvanize district efforts to make major revisions and improvements in their reading instruction. At the same time, the targeted provision of resources to the state’s neediest districts through categorical grants has enabled these districts to enhance their reading initiatives and to begin to close the gap between their scores and those statewide (Baron, 1999).
Among the 10 Connecticut districts that made the greatest progress in reading between 1990 and 1998, three─ New Britain, Norwalk, and Middletown─ are urban school systems in the group identified as the state’s ‘‘neediest’’ districts based on the percentage of students eligible for free lunch programs and their state test scores.
District
Grade Level
1993 CMT Index Score
1998 CMT Index Score
Gain in Average CMT Score
STATE
Grade 4
56.9
65.5
+ 8.6
AVERAGE
Grade 6
68.0
74.2
+ 6.2
Grade 8
69.9
75.5
+ 5.6
Middletown
Grade 4
51.8
65.7
+ 13.9
Grade 6
67.0
74.2
+ 7.2
Grade 8
64.7
75.6
+ 10.9
Norwalk
Grade 4
46.6
58.6
+ 12.0
Grade 6
55.3
62.7
+ 7.4
Grade 8
53.8
66.4
+ 12.6
New Britain
Grade 4
36.3
47.4
+ 11.1
Grade 6
35.0
45.6
+ 10.6
Grade 8
38.5
52.3
+ 13.8
Follow up studies in these districts identified a number of state-level policies and related local strategies as contributing to this success (Baron, 1999). Among them were teacher policies that have enabled districts to hire and retain highly qualified teachers who had been prepared to teach a wide range of learners, and the required beginning teacher program that provided state training for all mentors, thus increasing the knowledge and skills of veteran teachers along with beginners involved with the program. In addition, district respondents described state- and locally supported intensive professional development around the teaching of reading. Consistent with the student standards and the state assessments, professional development funds were orchestrated to improve teachers’ knowledge of how to teach reading through a balanced approach to whole language and skill-based instruction, how to address reading difficulties through specific intervention strategies, and how to diagnose and treat specific learning disabilities. Most of the districts had developed cadres of teacher trainers or coaches who were experts in literacy development and who were available to work with colleagues in the schools, offering demonstration teaching as well as classroom coaching. A number used state grants to sponsor intensive summer literacy workshops focused on the teaching of at-risk readers.
The approaches to reading instruction used in sharply improving districts rely on the enhanced teacher knowledge spurred in Connecticut’s teacher education reforms and represented in the state’s teaching assessments: systematic teaching of reading and spelling skills (including linguistics training that goes beyond basic phonemic awareness); use of authentic reading materials─ children’s literature, periodicals, and trade books─ along with daily writing and discussion of ideas; ongoing assessment of students’ reading proficiency through strategies like running records, miscue analyses, and analysis of reading, writing, and speaking samples; and intervention strategies for students with reading delays, such Reading Recovery, which was used in 9 of the 10 sharply improving districts and is widely used across the state (Baron, 1999).
District administrators noted the importance of the system’s coherence in allowing them to pursue these sophisticated strategies for teaching and learning. In addition to their work on teacher development, they described how they had realigned district curriculum and instruction to the student learning standards and assessments, and how they had used the rich information about student performance made available by the CSDE as the basis for school problem solving and teachers’ individual growth plans (the latter are part of the teacher evaluation system). They also credited the fact that the state assessments measured reading and writing in authentic ways, the preparation and professional development programs were supportive of the same approaches, and beginning teachers were coming to them better prepared to teach to these standards using successful pedagogical strategies, while veterans also had many opportunities to develop.
The quality of teaching in Connecticut can be traced directly to the implementation of an increasingly well-developed statewide infrastructure that has been designed to encourage high-quality teaching by (a) linking salaries to high standards for preparing, entering, and remaining in teaching, (b) providing intensive support and assessment of beginning teachers, and (c) requiring and supporting continued high-quality professional development for teachers and administrators. These factors have helped establish a foundation of professional expertise that can ensure the success of other organizational policies and practices, such as analysis of student achievement results, linking school improvement plans and teacher evaluations to student achievement, and aligning expectations and assessments for students with high standards for teachers.
New York City District #2
A remarkably similar set of strategies has produced similar results in New York City’s Community School District #2, an extremely diverse, multilingual district of 22,000 students of whom more than 70% are students of color and more than half are from families officially classified as having incomes below the poverty level.2 More than 100 different languages are spoken in the collective homes of District #2 students, a large share of whom are recent immigrants. During the decade-long tenure of superintendent Tony Alvarado, from 1987 to 1997, the district rose from 11th to 2nd in the city in student achievement in reading and mathematics, scoring above New York State norms as well as New York City averages, even while the population of the district grew more more language diverse.
Studies of District #2 have attributed these gains to the district’s decision to make professional development the central focus of management and the core strategy for school improvement. The strong belief governing the district’s efforts is that student learning will increase as the knowledge of educators grows (Elmore & Burney, 1997). Rather than treating professional development as a discrete function implemented with a set of disparate nonsystemic activities, District 2 makes professional development around common standards of teaching the most important focus of all district efforts, its most prominent discretionary budgetary commitment, and a key part of every leader’s and every teacher’s job.
After consolidating categorical funds and focusing them on a coherent program of professional learning, District 2 moved most of its central office personnel positions back to school sites to focus on the improvement of practice. In a set of moves intently focused on enhancing professional accountability, Alvarado aggressively recruited instructionally knowledgeable teachers and principals, created pointed expectations and opportunities for professional development around the deepening of instructional practice─ first in literacy and then in mathematics─ and replaced through retirements, ‘‘counseling out,’’ and personnel actions those underskilled principals and teachers who were unable or unwilling to develop their practice. Both principals and teachers were expected to learn about best practices in teaching literacy and mathematics, and school leaders were held accountable for their own and their colleagues’ increasing skill, for the quality of instructional practice in their buildings, for recruiting well-prepared new teachers, and for moving ineffective teachers out of the district.
While he was transforming the composition and skill set of the district staff, Alvarado created 17 Option Schools, small alternative schools that reorganized instruction to focus on greater personalization and more performance-based assessments to guide teaching, while encouraging the redesign of other schools. These efforts leveraged the creation of more small schools along with grouping practices that keep teachers and students together for more than one year, schedules that allow collaborative planning and professional development for teachers within the school day, and more coherent, intellectually challenging curriculum supported by ongoing diagnostic and performance assessments of student learning.
School redesign was joined with professional development in a conscious strategy to improve both teachers’ expertise and schools’ ability to support in-depth teaching and learning. Well-known for his efforts to create restructured schools and schools of choice when he was previously superintendent in District #4, Alvarado found that the creation of new alternatives, while useful for the schools where dynamic educators coalesced, did not go far enough in building knowledge for better practice in all schools and classrooms. As he explained, ‘‘When I moved to District 2, I was determined to push beyond the District 4 strategy and to focus more broadly on instructional improvement across the board, not just on the creation of alternative programs’’ (Elmore & Burney, 1997).
Staff development in District 2 differs substantially from the one-shot workshop that expects teachers to take generic ideas unconnected to their ongoing work and apply them in the classroom. Rather, the prevailing theory is that changes in instruction occur when teachers receive continuous support embedded in a coherent instructional system that is focused on the practical details of what it means to teach effectively. The district’s extensive professional development efforts, which have paid off in rapidly rising student achievement, include several vehicles for learning. Instructional consulting services allow expert teachers and consultants to work within schools with groups of teachers in sustained ways develop to particular strategies, such as literature-based reading instruction. Intervisitation and peer networks are designed to bring teachers and principals into contact with exemplary practices. The district budgets for 300 total days each year to provide the time for teachers and principals to visit and observe one another, to develop study groups, and to pair up for work together. Off-site training includes intensive summer institutes that focus on core teaching strategies and on learning about new standards, curriculum frameworks, and assessments. These are always linked to followup through consulting services and peer networks to develop practices further. The Professional Development Laboratory allows visiting teachers to spend 3 weeks in the classrooms of expert resident teachers who are engaged in practices they want to learn. Oversight and evaluation of principals focuses on their plans for instructional improvement in each content area, as does evaluation of teachers. There is close, careful scrutiny of teaching from the central office as well as the school and continual pressure and support to improve its quality. As Elmore and Burney (1997) explain:
Shared expertise takes a number of forms in District 2. District staff regularly visit principals and teachers in schools and classrooms, both as part of a formal evaluation process and as part of an informal process of observation and advice. Within schools, principals and teachers routinely engage in grade-level and cross-grade conferences on curriculum and teaching. Across schools, principals and teachers regularly visit other schools and classrooms. At the district level, staff development consultants regularly work with teachers in their classrooms. Teachers regularly work with teachers in other schools for extended periods of supervised practice. Teams of principals and teachers regularly work on districtwide curriculum and staff development issues. Principals regularly meet in each others’ schools and observe practice in those schools. Principals and teachers regularly visit schools and classrooms within and outside the district. And principals regularly work in pairs on common issues of instructional improvement in their schools. The underlying idea behind all these forms of interaction is that shared expertise is more likely to produce change than individuals working in isolation.
A key feature of these strategies is that they have focused intensely for multiple years on a few strands of content-focused training designed to have cumulative impact over the long term, rather than changing workshop topics every in-service day or picking new themes each year. The district has sponsored 8 years of intensive work on teaching strategies for literacy development and 4 years on mathematics teaching. District 2’s approach began with reading and writing because this focus provided a readily available way for the district to demonstrate improvement in academic performance in an area that was important on city-wide assessment measures and because literacy was important in the context of the district’s linguistic and ethnic diversity. New York City’s development of more performance-oriented assessments in reading and mathematics in