Assessing Reading

The goal of reading assessments is to provide feedback on the skills, processes, and knowledge resources that represent reading abilities. Reading assessments are used for many purposes. However, any appropriate use of reading assessments begins from an understanding of the reading construct, an awareness of the development of reading abilities, and an effort to reflect the construct in assessment tasks. In this chapter, we will first define the construct of reading. Then we will present a straightforward framework that categorizes many uses and purposes for reading assessment, including standardized reading proficiency assessment, classroom reading assessment, assessment for learning, assessment of curricular effectiveness, and assessment for research purposes. For each category in the assessment framework, we will outline and describe a number of major assessment techniques. Finally, we will explore some innovative techniques for reading assessment and discuss challenges and issues for reading assessment.

Introduction

In this chapter, we discuss the construct of reading comprehension abilities in relation to reading assessment, examine prior and current conceptualizations of reading abilities in assessment contexts, and describe why and how reading abilities are assessed. From a historical perspective, the “construct of reading” is a concept that has followed far behind the formal assessment of reading abilities (leaving aside for the moment the issue of classroom assessment of reading abilities). In fact, the construct of reading comprehension abilities, as well as all the relevant component subskills, knowledge bases, and cognitive processes (hereafter “component skills”), had not been well thought out and convincingly described in assessment contexts until the 1990s. It is interesting to note, in light of this point, a quote by Clapham (1996) on efforts to develop the IELTS reading modules:

We had asked applied linguists for advice on current theories of language proficiency on which we might base the IELTS test battery. However, the applied linguists’ responses were varied, contradictory and inconclusive, and provided little evidence for a construct for EAP tests on which we could base the test. (p. 76)

Similar limitations can be noted for the TOEFL of the 1980s (Taylor & Angelis, 2008) and the earlier versions of the Cambridge ESOL suite of tests (see Weir & Milanovic, 2003; Hawkey, 2009; Khalifa & Weir, 2009). Parallel limitations with classroom-based assessments in second language contexts were evident until fairly recently with the relatively narrow range of reading assessment options typically used (often limited to multiple choice items, true/false items, matching items, and brief open-ended response items). Fortunately, this situation has changed remarkably in the past 15 years, and very useful construct research (and construct statements for assessment purposes) is now available to help conceptualize reading assessment.

The transition from reliability to validity as the driving force behind standardized reading assessment development in the past 20 years has focused on efforts to reconceptualize reading assessment practices. Most importantly, this reconceptualization reflects a more empirically supported reading construct, one that has also led to a wider interpretation of reading purposes generally (Grabe, 2009) and in reading assessment contexts more specifically, for instance, reading to learn and expeditious reading (Enright et al., 2000; Khalifa & Weir, 2009).

Reading assessment itself involves a range of purposes that reflect multiple assessment contexts: standardized proficiency assessment, classroom-based formative and achievement testing, placement and diagnostic testing, assessment for reading research purposes (Grabe, 2009), and assessment-for-learning purposes (Black & Wiliam, 2006). The first two of these contexts take up the large part of this chapter (see Grabe, 2009, for discussion of all five purposes for reading assessment).

In the process of discussing these purposes for reading assessment, questions related to how reading assessments should be carried out are also addressed. The changing discussions of the reading construct, the redesign of standardized assessments for second language learners, and the need to assess aspects of the reading construct that were previously ignored have led to a wide range of assessment task types, some of which had not been given serious consideration until the late 1990s.

Previous Conceptualizations

Reading comprehension ability has a more intriguing history than is commonly recognized, and it is a history that has profoundly affected how reading comprehension is assessed. Before the 20th century, most people did not read large amounts of material silently for comprehension. For the much smaller percentage of test takers in academic settings, assessment emphases were placed on literature, culture, and interpretation involving more subjectively measured items. The 20th century, in its turn, combined a growing need for many more people capable of reading large amounts of text information for comprehension with many more uses of this information in academic and work contexts. In the USA, for example, while functional literacy was estimated at 90% at the turn of the 20th century, this may have been defined simply as completing one or two years of schooling. In the 1930s, functional literacy in the USA was placed at 88%, being defined as a third grade completion rate (Stedman & Kaestle, 1991). The pressure to educate a much larger percentage of the population in informational literacy skills, and silent reading comprehension skills in particular, was driven, in part, by the need for more literate soldiers in World Wars I and II, more literate industrial workers, and increasingly higher demands placed on student performance in educational settings (Pearson & Goodin, 2010).

Within academic settings, the rise of objective testing practices from a rapidly developing field of educational psychology and psychological measurement spurred on large-scale comprehension assessment. However, for the US context, it was only in 1970 that comprehension assessments provided a reliable national picture of English first language (L1) reading abilities, and their patterns of variation, through the NAEP (National Assessment of Educational Progress) testing program and public reports. If broad-based reading comprehension skills assessment has been a relatively recent development, so also has been the development of reading assessment measures that reflect an empirically derived construct of reading abilities.

During the period from the 1920s to the 1960s, objective assessment practices built on psychometric principles were powerful shaping forces for reading assessment in US contexts. In line with these pressures for more objective measurement, L2 contexts were not completely ignored. The first objectively measured foreign language reading test was developed in 1919 (Spolsky, 1995). In the UK, in contrast, there was a strong counterbalancing emphasis on expert validity. In the first half of the 20th century, this traditional validity emphasis sometimes led to more interesting reading assessment tasks (e.g., summarizing, paraphrasing, text interpretation), but also sometimes led to relatively weak assessment reliability (Weir & Milanovic, 2003).

By the 1960s and 1970s, the pressure to deliver objective test items led to the development of the TOEFL as a multiple choice test and led to changes in assessment practices with the Cambridge ESOL suite as well as the precursor of the IELTS (i.e., ELTS and the earlier EPTB, the English Proficiency Test Battery) (Clapham, 1996; Weir & Milanovic, 2003). At the same time, the constraints of using multiple choice and matching items also limited which aspects of reading abilities could be reliably measured. Starting in the 1970s, the pressures of communicative competence and communicative language teaching led to strong claims for the appropriateness of integrative reading assessments (primarily cloze testing). However, from 1980 onwards, the overwhelming output of cognitive research on reading abilities led to a much broader interpretation of reading abilities, one that was built from several component subskills and knowledge bases. From 1990 onward, research on reading comprehension has been characterized by the roles of various component subskills on reading performance, and on reading for different purposes (reading to learn, reading for general comprehension, expeditious reading, etc.). This expansion of reading research has also led to more recent conceptualizations of the reading construct as the driving force behind current standardized reading assessment practices.

Current Conceptualizations

In considering current views on reading assessment, we focus primarily on standardized assessment and classroom-based assessment practices. These are the two most widespread uses of reading assessment, and the two purposes that have the greatest impact on test takers. In both cases, the construct of reading abilities is a central issue. The construct of reading has been described recently in a number of ways, mostly with considerable overlap (see Alderson, 2000; Grabe, 2009; Khalifa & Weir, 2009; Adlof, Perfetti, & Catts, 2011). Based on what can now be classified as thousands of empirical research studies on reading comprehension abilities, the consensus that has emerged is that reading comprehension comprises several component language skills, knowledge resources, and general cognitive abilities. The use of these component abilities in combinations varies by proficiency, overall reading purpose, and specific task.

Research in both L1 and L2 contexts has highlighted those factors that strongly impact reading abilities and account for individual differences in reading comprehension performance:

efficient word recognition processes (phonological, orthographic, morphological, and semantic processing);
a large recognition vocabulary (vocabulary knowledge);
efficient grammatical parsing skills (grammar knowledge under time constraints);
the ability to formulate the main ideas of a text (formulate and combine appropriate semantic propositions);
the ability to engage in a range of strategic processes while reading more challenging texts (including goal setting, academic inferencing, monitoring);
the ability to recognize discourse structuring and genre patterns, and use this knowledge to support comprehension;
the ability to use background knowledge appropriately;
the ability to interpret text meaning critically in line with reading purposes;
the efficient use of working memory abilities;
the efficient use of reading fluency skills;
extensive amounts of exposure to L2 print (massive experience with L2 reading);
the ability to engage in reading, to expend effort, to persist in reading without distraction, and achieve some level of success with reading (reading motivation).

These factors, in various combinations, explain reading abilities for groups of readers reading for different purposes and at different reading proficiency levels. Given this array of possible factors influencing (and explaining) reading comprehension abilities, the major problems facing current L2 assessment development are (a) how to explain these abilities to wider audiences, (b) how best to measure these component skills within constrained assessment contexts, and (c) how to develop assessment tasks that reflect these component skills and reading comprehension abilities more generally.

Standardized Reading Assessment

Major standardized reading assessment programs consider the construct of reading in multiple ways. It is possible to describe the reading construct in terms of purposes for reading, representative reading tasks, or cognitive processes that support comprehension. To elaborate, a number of purposes for engaging in reading can be identified, a number of representative reading tasks can be identified, and a set of cognitive processes and knowledge bases can be considered as constitutive of reading comprehension abilities. Of the three alternative descriptive possibilities, reading purpose provides the most transparent explanation to a more general public as well as to test takers, test users, and other stakeholders. Most people can grasp intuitively the idea of reading to learn, reading for general comprehension, reading to evaluate, expeditious reading, and so on. Moreover, these purposes incorporate several key reading tasks and major component skills (many of which vary in importance depending on the specific purpose), thus providing a useful overarching framework for the “construct of reading” (see Clapham, 1996; Enright et al., 2000; Grabe, 2009; Khalifa & Weir, 2009). This depiction of reading abilities, developed in the past two decades, has also led to a reconsideration of how to assess reading abilities within well recognized assessment constraints. It has also led to several innovations in test tasks in standardized assessments. This trend is exemplified by new revisions to the Cambridge ESOL suite of exams, the IELTS, and the iBT TOEFL.

The Cambridge ESOL suite of exams (KET, PET, FCE, CAE, CPE) has undergone important changes in its conceptualization of reading assessment (see Weir & Milanovic, 2003; Hawkey, 2009; Khalifa & Weir, 2009). As part of the process, the FCE, CAE, and CPE have introduced reading assessment tests and tasks that require greater recognition of the discourse structure of texts, recognition of main ideas, careful reading abilities, facility in reading multiple text genres, and a larger amount of reading itself. Reading assessment tasks now include complex matching tasks of various types, multiple choice items, short response items, and summary writing (once again).

IELTS (the International English Language Testing System) similarly expanded its coverage of the purposes for reading to include reading for specific information, reading for main ideas, reading to evaluate, and reading to identify a topic or theme. Recent versions of the IELTS include an academic version and a general training version. The IELTS academic version increased the amount of reading required, and it includes short response items of multiple types, matching of various types, several complex readings with diagrams and figures, and innovative fill-in summary tasks.

The iBT TOEFL has similarly revised its reading section based on the framework of reader purpose. Four reading purposes were initially considered in the design of iBT TOEFL reading assessment: reading to find information, reading for basic comprehension, reading to learn, and reading to integrate (Chapelle, Enright, & Jamieson, 2008), although reading to integrate was not pursued after the pilot study. iBT TOEFL uses three general item types to evaluate readers’ academic reading proficiency: basic comprehension items, inferencing items, and reading-to-learn items. Reading to learn has been defined as “developing an organized understanding of how the main ideas, supporting information, and factual details of the text form a coherent whole” (Chapelle et al., 2008, p. 111), for which two new tasks, prose summary and schematic table, were included. In addition, the iBT TOEFL uses longer, more complex texts than the ones used in the traditional TOEFL.

In all three of these standardized test systems, revisions drew upon well articulated and empirically supported constructs of reading abilities as they apply to academic contexts. In all three cases, greater attention has been given to longer reading passages, to discourse organization, and to an expanded concept of reading to learn or reading to evaluate. At the same time, a number of component reading abilities are obviously absent, reflecting the limitations of international standardized reading assessment imposed by cost, time, reliability demands, and fairness across many country settings. (Standardized English L1 reading assessment practices are far more complex.) These limited operationalizations of L2 reading abilities are noted by Alderson (2000), Weir and Milanovic (2003), Grabe (2009), and Khalifa and Weir (2009).

Among the abilities that the new iBT TOEFL did not pursue are word recognition efficiency, reading to scan for information, summarizing, and reading to integrate information from multiple texts. Khalifa and Weir (2009) note that the Cambridge suite did not pursue reading to scan, reading to skim, or reading rate (fluency). All three come under the umbrella term “expeditious reading” and, for their analysis, this gap represents a limitation in the way the reading construct has been operationalized in the Cambridge suite (and in IELTS). IELTS revisions had considered including short response items and summary writing. In recent versions, it has settled for a more limited but still innovative cloze summary task.

Returning to the list of component skills noted earlier, current standardized reading assessment has yet to measure a full range of component abilities of reading comprehension (and may not be able to do so in the near future). Nonetheless, an assessment of reading abilities should reflect, as far as possible, the abilities a skilled reader engages in when reading for academic purposes (leaving aside adult basic literacy assessments and early child reading assessments). The following is a list of the component abilities of reading comprehension that are not yet well incorporated into L2 standardized reading assessment (from Grabe, 2009, p. 357):

passage reading fluency and reading rate,
automaticity and rapid word recognition,
search processes,
morphological knowledge,
text structure awareness and discourse organization,
strategic processing abilities,
summarization abilities (and paraphrasing),
synthesis skills,
complex evaluation and critical reading.

How select aspects of these abilities find their ways into standardized L2 reading assessment practices is an important challenge for the future.

Although researchers working with standardized reading tests have made a serious effort to capture crucial aspects of the component abilities of reading comprehension (e.g., Khalifa & Weir, 2009; Chapelle et al., 2008; Hawkey, 2009), construct validity still represents a major challenge for L2 reading assessment because the number and the types of assessment tasks are strictly constrained in the context of standardized testing. If the construct is under-represented by the test, it is difficult to claim that reading comprehension abilities are being fully measured. This difficulty also suggests that efforts to develop an explanation of the reading construct from L2 reading tests face the challenge of construct under-representation in the very tests being used to develop the construct (a fairly common problem until recently). Perhaps with greater uses of computer technology in testing, the control over time for individual items or sections can be better managed, and innovative item types can be incorporated without disrupting assessment procedures. In addition, as suggested by Shiotsu (2010), test taker performance information recorded by computers may not only assist decision making but might also be used for diagnostic purposes. One of the most obvious potential applications of the computer is to more easily incorporate skimming, reading-to-search, reading fluency, and reading rate measures. Such an extension in the future would be welcome.

Classroom-Based Reading Assessment

Moving on from standardized assessments, the second major use of L2 reading assessments takes place in classroom contexts. In certain respects, classroom-based assessment provides a complement to standardized assessment in that aspects of the reading construct not accounted for by the latter can easily be included in the former. In many classroom-based assessment contexts, teachers observe, note, and chart students’ reading rates, reading fluency, summarizing skills, use of reading information in multistep tasks, critical evaluation skills, and motivation and persistence to read.

Reading assessment in these contexts is primarily used to measure student learning (and presumably to improve student learning). This type of assessment usually involves the measurement of skills and knowledge gained over a period of time based on course content and specific skills practiced. Typically, classroom teachers or teacher groups are responsible for developing the tests and deciding how the scores should be interpreted and what steps to take as a result of the assessment outcomes (Jamieson, 2011). Classroom learning can be assessed at multiple points in any semester and some commonly used classroom assessments include unit achievement tests, quizzes of various types, and midterm and final exams. In addition to the use of tests, informal and alternative assessment options are also useful for the effective assessment of student learning, using, for example, student observations, self-reporting measures, and portfolios. A key issue for informal reading assessment is the need for multiple assessment formats (and multiple assessment points) to evaluate a wide range of student performances for any decisions about student abilities or student progress. The many small assessments across many tasks helps overcome the subjectivity of informal assessment and strengthens the effectiveness and fairness of informal assessments.

Classroom-based assessment makes use of the array of test task types found in standardized assessments (e.g., cloze, gap-filling formats [rational cloze formats], text segment ordering, text gaps, multiple choice questions, short answer responses, summary writing, matching items, true/false/not stated questions, editing, information transfer, skimming, scanning). Much more important for the validity of classroom assessment, though less commonly recognized, are the day-to-day informal assessments and feedback that teachers regularly provide to students. Grabe (2009) identifies six categories of classroom-based assessment practices and notes 25 specific informal assessment activities that can be, and often are, carried out by teachers. These informal activities include (a) having students read aloud in class and evaluating their reading, (b) keeping a record of student responses to questions in class after a reading, (c) observing how much time students spend on task during free reading or sustained silent reading (SSR), (d) observing students reading with an audiotape or listening to an audiotaped reading, (e) having students list words they want to know after reading and why, (f) having students write simple book reports and recommend books to others, (g) keeping charts of student reading rate growth, (h) having a student read aloud for the teacher/tester and making notes, or using a checklist, or noting miscues on the text, (i) noting students’ uses of texts in a multistep project and discussing these uses, and (j) creating student portfolios of reading activities or progress indicators.

Among these informal assessment activities, it is worth pointing out that oral reading fluency (reading aloud) assessment has attracted much research interest in L1 contexts. Oral reading fluency has been found to serve as a strong predictor of general comprehension (Shinn, Knutson, Good, Tilly, & Collins, 1992; Fuchs, Fuchs, Hosp, & Jenkins, 2001; Valencia et al., 2010). Even with a one-minute oral reading measure, teachers can look into multiple indicators of oral reading fluency (e.g., rate, accuracy, prosody, and comprehension) and obtain a fine-grained understanding of students’ reading ability, particularly if multiple aspects of student reading performances are assessed (Kuhn, Schwanenflugel, & Meisinger, 2010; Valencia et al., 2010). However, research on fluency assessment has not been carried out in L2 reading contexts. Practices of reading aloud as an L2 reading assessment tool will benefit from research on the validity of oral reading fluency assessment in the L2 context.

Another aspect of classroom-based assessment that is gaining in recognition is the concept of assessment for learning (Black & Wiliam, 2006; Wiliam, 2010). This approach draws on explicit classroom tests, informal assessment practices, and opportunities for feedback from students to teachers that indicate a need for assistance or support. The critical goal of this assessment approach is to provide immediate feedback on tasks and to teach students to engage in more effective learning instead of evaluation of their performance. An important element of assessment for learning is the follow-up feedback and interaction between the teacher and the students. Through this feedback, teachers respond with ongoing remediation and fine-tuning of instruction when they observe non-understanding or weak student performances. The key is not to provide answers, but to enhance learning, work through misunderstandings that are apparent from student performance, develop effective learning strategies, and encourage student self-awareness and motivation to improve. Grabe (2009) notes 15 ideas and techniques for assessment for learning. Although these ideas and techniques apply to any learning and assessment context, they are ideally suited to reading tasks and reading comprehension development.

Current L2 Reading Assessment Research

In addition to the volume-length publications on assessment development and validation with three large-scale standardized L2 tests (e.g., Clapham, 1996; Weir & Milanovic, 2003; Chapelle et al., 2008; Hawkey, 2009; Khalifa & Weir, 2009) reviewed above, this section will focus on recent journal publications related to reading assessment. We searched through two of the most important assessment journals, Language Testing and Language Assessment Quarterly, for their publications in the past 10 years and found that the recent research on reading assessment focused mainly on the topics of test tasks, reading texts, and reading strategies.

We note here seven studies relevant to conceptualizations of the L2 reading construct and ways to assess the reading construct. The first four studies focus on aspects of discourse structure awareness, complex text analysis tasks, and the role of the texts themselves. Two subsequent studies focus on the role of reading strategies and reading processes in testing contexts. At issue is whether or not multiple choice questions bias text reading in unintended ways. The final study examines the role of memory in reading assessment as a further possible source of bias. Overall, it is important to note that research articles on L2 reading assessment are relatively uncommon in comparison with research on speaking and writing assessment (and performance scoring issues).

Kobayashi (2002) examined the impact of discourse organization awareness on reading performance. Specifically, she investigated whether text organization (association, description, causation, and problem-solution) and response format (cloze, open-ended questions, and summary writing) have a systematic influence on test results of learners at different proficiency levels (high, middle, and low). She found that text organization did not lead to strong performance differences for test formats that measured less integrative comprehension such as cloze tests or for learners of limited L2 proficiency. On the contrary, stronger performance differences due to organizational differences in texts were observed for testing formats that measure more integrative forms of comprehension tasks (open-ended questions and summary writing), especially for learners with higher levels of L2 proficiency. The more proficient students benefited from texts with a clear structure for summary writing and open-ended questions. She suggested that “it is essential to know in advance what type of text organization is involved in passages used for reading comprehension tests, especially in summary writing with learners of higher language proficiency” (p. 210). The study confirms previous findings that different test formats seem to measure different aspects of reading comprehension and that text organization can influence reading comprehension based on more complex reading tasks.

Yu (2008) also contributed to issues in discourse processing by exploring the use of summaries for reading assessment with 157 Chinese university students in an undergraduate EFL program. The study looked at the relationships between summarizing an L2 text in the L2 versus in the L1, as well as relationships among both summaries (L1 and L2) and an L2 reading measure, an L2 writing measure, and a translation measure. Findings showed that test takers wrote longer summaries in the L1 (Chinese) but were judged to have written better summaries in their L2 (English). Perhaps more importantly, summary writing in Chinese and English only correlated with L2 reading measures at .30 and .26 (r² of .09 and .07 respectively, for only the stronger of two summary quality measures). These weak correlations suggest that summary writing measures something quite different from the TOEFL reading and writing measures used. Yu found no relationships between summary-writing quality and the TOEFL writing or translation measures. In a questionnaire and follow-up interviews, test takers also felt that summary writing was a better indicator of their comprehension abilities than of their writing abilities. While this is only one study in one context, it raises interesting questions about the role of summarizing in reading assessment, which needs to be examined further.

Trites and McGroarty (2005) addressed the potential impact of more complex reading tasks that go beyond only measures of basic comprehension. The authors reported the design and use of new measures to assess the more complex reading purposes of reading to learn and reading to integrate (see Enright et al., 2000). Based on the analyses of data from both native and non-native speakers, the authors found that new tasks requiring information synthesis assessed something different from basic comprehension, after a lower level of basic academic English proficiency had been achieved. The authors speculated that “the new measures tap additional skills such as sophisticated discourse processes and critical thinking skills in addition to language proficiency” (p. 199).

Green, Unaldi, and Weir (2010) focused on the role of texts, and especially disciplinary text types, for testing purposes. They examined the authenticity of reading texts used in IELTS by comparing IELTS Academic Reading texts with the texts that first year undergraduates most needed to read and understand once enrolled at their universities. The textual features examined in the study included vocabulary and grammar, cohesion and rhetorical organization, genre and rhetorical task, subject and cultural knowledge, and text abstractness. The authors found that the IELTS texts have many of the features of the kinds of text encountered by first year undergraduates and there are few fundamental differences between them. The findings support arguments made by Clapham (1996) that nonspecialist texts of the kind employed in IELTS can serve as a reasonable substitute for testing purposes.

Rupp, Ferne, and Choi (2006) explored whether or not test takers read in similar ways when reading texts in a multiple choice testing context and when reading texts in non-testing contexts. Using qualitative analyses of data from introspective interviews, Rupp et al. (2006) found that asking test takers to respond to text passages with multiple choice questions induced response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts. The test takers in their study were found to “often segment a text into chunks that were aligned with individual questions and focused predominantly on the microstructure representation of a text base rather than the macrostructure of a situation model” (p. 469). The authors speculated that “higher-order inferences that may lead to an integrated macrostructure situation model in a non-testing context are often suppressed or are limited to grasping the main idea of a text” (p. 469). The construct of reading comprehension that is assessed and the processes that learners engage in seem to have changed as a result of the testing format and text types used. The authors assert that the construct of reading comprehension turns out to be assessment specific and is fundamentally determined through item design and text selection. (This issue of test variability in reading assessments has also been the focus of L1 reading research, with considerable variability revealed across a number of standardized tests; see Keenan, Betjemann & Olson, 2008.)

Cohen and Upton (2007) described reading and test-taking strategies that test takers use to complete reading tasks in the reading sections of the LanguEdge Courseware (2002) materials developed to introduce the design of the new TOEFL (iBT TOEFL). The study sought to determine if there is variation in the types of strategies used when answering three broad categories of question types: basic comprehension item types, inferencing item types, and reading-to-learn item types. Think-aloud protocols were collected as the participants worked through these various item types. The authors reported two main findings: (a) participants approached the reading section of the test as a test-taking task with a primary goal of getting the answers right, and (b) “the strategies deployed were generally consistent with TOEFL's claims that the successful completion of this test section requires academic reading-like abilities” (p. 237). Unlike those in Rupp et al. (2006), the participants in this study were found to draw on their understanding and interpretation of the passage to answer the questions, except when responding to certain item formats like basic comprehension vocabulary. However, their subjects used 17 out of 28 test-taking strategies regularly, but only 3 out of 28 reading strategies regularly. So, while subjects may be reading for understanding in academic ways, they are probably not reading academic texts in ways in which they would read these texts in non-testing contexts. In this way, at least, the results of Cohen and Upton (2007) converge with the findings of Rupp et al. (2006).

Finally, Chang (2006) examined whether and how the requirement of memory biases our understanding of readers’ comprehension. The study compared L2 readers’ performance on an immediate recall protocol (a task requiring memory) and on a translation task (a task without the requirement of memory). The study revealed that the translation task yielded significantly more evidence of comprehension than did the immediate recall task, which indicates that the requirement of memory in the recall task may hinder test takers’ abilities to demonstrate fully their comprehension of the reading passage. The results also showed that the significant difference found in learners’ performance between the immediate recall and the translation task spanned the effect of topics and proficiency levels. This study provides evidence that immediate free recall tasks might have limited validity as a comprehension measure due to its memory-related complication. Certainly, more research is needed on the role and relevance of memory processes as part of reading comprehension abilities.

Challenges

A number of important challenges face reading assessment practices. One of the most important challenges for reading assessment stems from the complexity of the construct of reading ability itself. Reading comprehension is a multicomponent construct which involves many skills and subskills (at least the 12 listed above). The question remains how such an array of component abilities can best be captured within the operational constraints of standardized testing, what new assessment tasks might be developed, and what component abilities might best be assessed indirectly (Grabe, 2009). In standardized assessment contexts, practices that might expand the reading assessment construct are constrained by concerns of validity, reliability, time, cost, usability, and consequence, which limit the types of reading assessment tasks that can be used. In classroom-based contexts, effective reading assessments are often constrained by relatively minimal awareness among teachers that a range of reading abilities, reflecting the reading construct, need to be assessed.

A second challenge is the need to reconcile the connection between reading in a testing context and reading in non-testing contexts. Whether or not a text or task has similar linguistic and textual features in a testing context to texts in non-test uses (that is, how authentic the text is) does not address what test takers actually do when encountering these texts in a high stakes testing situation. When students read a text as part of standardized assessment, they know that they are reading for an assessment purpose. So, for example, although the characteristics of the academic reading texts used in IELTS were said to share most of the textual characteristics of first year undergraduate textbook materials (Green et al., 2010), the context for standardized assessment may preclude any strong assumption of a match to authentic reading in the “real world” (see, e.g., Rupp et al., 2006; Cohen & Upton, 2007). One outcome is that it is probably not reasonable to demand that the reading done in reading assessments exactly replicate “real world” reading experiences. However, the use of realistic texts, tasks, and contexts should be expected because it supports positive washback for reading instruction; that is to say, texts being used in testing and language instruction are realistic approximations for what test takers will need to read in subsequent academic settings.

A third challenge is how to assess reading strategies, or “the strategic reader.” Rupp et al. (2006) found that the strategies readers use in assessment contexts were different from the ones they use in real reading contexts and even the construct of reading comprehension is assessment-specific and determined by the test design and text format. On the other hand, Cohen and Upton (2007) found that, although the participants approached the reading test as a test-taking task, the successful completion of the test requires both local and general understanding of the texts, which reflects academic-like reading abilities. This debate leaves open a key question: If readers use strategies differently in non-testing contexts and in testing contexts, how should we view the validity of reading assessments (assuming strategy use is a part of the reading construct)? Clearly, more research is needed on the use of, and assessment of, reading strategies in testing contexts.

A fourth challenge is the possible need to develop a notion of the reading construct that varies with growing proficiency in reading. In many L2 reading assessment situations, this issue is minimized (except for the Cambridge ESOL suite of language assessments). Because English L2 assessment contexts are so often focused on EAP contexts, there is relatively little discussion of how reading assessments should reflect a low-proficiency interpretation of the L2 reading construct (whether for children, or beginning L2 learners, or for basic adult literacy populations). It is clear that different proficiency levels require distinct types of reading assessments, especially when considering research in L1 reading contexts (Paris, 2005; Adlof et al., 2011). In L2 contexts, Kobayashi (2002) found that text organization and response format have an impact on the performance of readers at different proficiency levels. The implication of this finding is that different texts, tasks, and task types are appropriate at different proficiency levels. In light of this finding, how should reading assessment tasks and task types change with growing L2 proficiency? Can systematic statements be made in this regard? Should proficiency variability be reflected at the level of the L2 reading construct and, if so, how?

Future Directions

In some respects, the challenges to L2 reading assessment and future directions for reading assessment are two sides of the same coin. In closing this chapter, we suggest five future directions as a set of issues that L2 reading assessment research and practice should give more attention to. These directions do not necessarily reflect current conflicts in research findings or immediate challenges to the validity of reading assessment, but they do need to be considered carefully and acted upon in the future.

First, different L2 reading tests likely measure students differently. This is not news to reading assessment researchers, but this needs to be explored more explicitly and systematically in L2 reading contexts. Standardized assessment programs may not want to know how their reading tests compare with other reading tests, so this is work that might not be carried out by testing corporations. At the same time, such work can be expensive and quite demanding on test takers. Nonetheless, with applied linguists regularly using one or another standardized test for research purposes, it is important to know how reading measures vary. One research study in L1 contexts (Keenan et al., 2008) has demonstrated that widely used L1 reading measures give different sets of results for the same group of test takers. Work of this type would be very useful for researchers studying many aspects of language learning.

Second, the reading construct is most likely under-represented by all well-known standardized reading assessment systems. A longer-term goal of reading assessment research should be to try to expand reading measures to more accurately reflect the L2 reading construct. Perhaps this work can be most usefully carried out as part of recent efforts to develop diagnostic assessment measures for L2 reading because much more detailed information could be collected in this way. Such work would, in turn, improve research on the L2 reading construct itself. At issue is the extent to which we can (and should) measure reading passage fluency, main idea summarizing skills, information synthesis from multiple text sources, strategic reading abilities, morphological knowledge, and possibly other abilities.

Third, L2 readers are not a homogeneous group and they bring different background knowledge when reading L2 texts. They vary in many ways in areas such as cultural experiences, topic interest, print environment, knowledge of genre and text structures, and disciplinary knowledge. In order to control for unnecessary confounding factors related to these differences in prior knowledge, more attention should be paid to issues of individual variation, especially in classroom-based assessments, so no test takers are advantaged or disadvantaged due to these differences.

Fourth, computers and new media are likely to alter how reading tests and reading tasks evolve. Although we believe that students in reading for academic purposes contexts are not going to magically bypass the need to read print materials and books for at least the near future, we need to recognize that the ability to read online texts is becoming an important part of the general construct of reading ability. As a result, more attention needs to be paid to issues of reading assessment tied to reading of online texts, especially when research has indicated a low correlation between students who are effective print readers versus students who are effective online readers (Coiro & Dobler, 2007). At the same time, reading assessment research will need to examine the uses of computer-based assessments and assessments involving new media. A major issue is how to carry out research that is fair, rigorous, and relatively free of enthusiastic endorsements or the selling of the “new” simply because it is novel.

Finally, teachers need to be trained more effectively to understand appropriate assessment practices. A large number of teachers still have negative attitudes to the value of assessment measures for student evaluation, student placement, and student learning. In many cases, L2 training programs do not require an assessment course, or the course is taught in a way that seems to turn off future teachers. As a consequence, teachers allow themselves to be powerless to influence assessment practices and outcomes. In such settings, teachers, in effect, cheat themselves by being excluded from the assessment process, and they are not good advocates for their students. Perhaps most importantly, teachers lose a powerful tool to support student learning and to motivate students more effectively. The problem of teachers being poorly trained in assessment practices is a growing area of attention in L1 contexts; it should also be a more urgent topic of discussion in L2 teacher-training contexts.

SEE ALSO: Chapter Assessing Literacy, Assessing Literacy; Chapter Assessing Integrated Skills, Assessing Integrated Skills; Chapter Large‐Scale Assessment, Large-Scale Assessment; Chapter Defining Constructs and Assessment Design, Defining Constructs and Assessment Design; Chapter Adapting or Developing Source Material for Listening and Reading Tests, Adapting or Developing Source Material for Listening and Reading Tests; Chapter Fairness and Justice in Language Assessment, Fairness and Justice in Language Assessment; Chapter Classroom‐Based Assessment Issues for Language Teacher Education, Classroom-Based Assessment Issues for Language Teacher Education; Chapter Ongoing Challenges in Language Assessment, Ongoing Challenges in Language Assessment

References

Adlof, S., Perfetti, C., & Catts, H. (2011). Developmental changes in reading comprehension: Implications for assessment and instruction. In S. Samuels & A. Farstrup (Eds.), What research has to say about reading instruction ( 4th ed., pp. 186–214). Newark, DE: International Reading Association.
10.1598/0829.08
Google Scholar
Alderson, J. (2000). Assessing reading. New York, NY: Cambridge University Press.
10.1017/CBO9780511732935
Google Scholar
Black, P., & Wiliam, D. (2006). Assessment for learning in the classroom. In J. Gardner (Ed.), Assessment and learning (pp. 9–25). London, England: Sage.
Web of Science® Google Scholar
Chang, Y.-F. (2006). On the use of the immediate recall task as a measure of second language reading comprehension. Language Testing, 23(4), 520–43.
10.1191/0265532206lt340oa
Google Scholar
C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language. New York, NY: Routledge.
Google Scholar
Clapham, C. (1996). The development of IELTS: A study in the effect of background knowledge on reading comprehension. Studies in language testing, 6. New York, NY: Cambridge University Press.
Google Scholar
Cohen, A. D., & Upton, T. A. (2007). “ I want to go back to the test”: Response strategies on the reading subtest of the new TOEFL. Language Testing, 24(2), 209–50.
10.1177/0265532207076364
Google Scholar
Coiro, J., & Dobler, E. (2007). Exploring the online reading comprehension strategies used by sixth-grade skilled readers to search for and locate information on the Internet. Reading Research Quarterly, 42, 214–57.
10.1598/RRQ.42.2.2
Web of Science® Google Scholar
Enright, M., Grabe, W., Koda, K., Mosenthal, P., Mulcahy-Ernt, P., & Schedl, M. (2000). TOEFL 2000 reading framework: A working paper. TOEFL monograph, 17. Princeton, NJ: Educational Testing Service.
Google Scholar
Fuchs, L., Fuchs, D., Hosp, M., & Jenkins, J. (2001). Oral reading fluency as an indicator of reading competence: A theoretical, empirical, and historical analysis. Scientific Studies of Reading, 5, 239–56.
10.1207/S1532799XSSR0503_3
Google Scholar
Grabe, W. (2009). Reading in a second language: Moving from theory to practice. New York, NY: Cambridge University Press.
Google Scholar
Green, A., Unaldi, A., & Weir, C. (2010). Empiricism versus connoisseurship: Establishing the appropriacy of texts in tests of academic reading. Language Testing, 27(2), 191–211.
10.1177/0265532209349471
Web of Science® Google Scholar
Hawkey, R. (2009). Examining FCE and CAE: Key issues and recurring themes in developing the First Certificate in English and Certificate in Advanced English exams. Studies in language testing, 28. New York, NY: Cambridge University Press.
Google Scholar
Jamieson, J. (2011). Assessment of classroom language learning. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (Vol. 2, pp. 768–85). New York, NY: Routledge.
Google Scholar
Keenan, J., Betjemann, R., & Olson, R. (2008). Reading comprehension tests vary in the skills they assess: Differential dependence on decoding and oral comprehension. Scientific Studies of Reading, 12(3), 281–300.
10.1080/10888430802132279
Web of Science® Google Scholar
Khalifa, H., & Weir, C. J. (2009), Examining reading. Cambridge, England: Cambridge University Press.
Google Scholar
Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text organization and response format. Language Testing, 19(2), 193–220.
10.1191/0265532202lt227oa
Google Scholar
Kuhn, M. R., Schwanenflugel, P. J., & Meisinger, E. B. (2010). Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Reading Research Quarterly, 45(2), 230–51.
10.1598/RRQ.45.2.4
Google Scholar
Paris, G. S. (2005). Reinterpreting the development of reading skills. Reading Research Quarterly, 40(2), 184–202.
10.1598/RRQ.40.2.3
Web of Science® Google Scholar
Pearson, P. D., & Goodin, S. (2010). Silent reading pedagogy: A historical perspective. In E. Hiebert & D. R. Reutzel (Eds.), Revisiting silent reading (pp. 3–23). Newark, DE: International Reading Association.
Google Scholar
Rupp, A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-choice questions shapes the construct: A cognitive processing perspective. Language Testing, 23(4), 441–74.
10.1191/0265532206lt337oa
Google Scholar
Shinn, M. R., Knutson, N., Good, R. H., Tilly, W. D., & Collins, V. L. (1992). Curriculum-based measurement of oral reading fluency: A confirmatory analysis of its relation to reading. School Psychology Review, 21, 459–79.
Web of Science® Google Scholar
Shiotsu, T. (2010). Components of L2 reading: Linguistic and processing factors in the reading test performances of Japanese EFL learners. Studies in language testing, 32. New York, NY: Cambridge University Press.
Google Scholar
Spolsky, B. (1995). Measured words. New York, NY: Oxford University Press.
Google Scholar
Stedman, L., & Kaestle, C. (1991). Literacy and reading performance in the United States from 1880 to the present. In C. Kaestle, H. Damon-Moore, L. C. Stedman, K. Tinsley, & W. V. Trollinger (Eds.), Literacy in the United States (pp. 75–128). New Haven, CT: Yale University Press.
Google Scholar
Taylor, C., & Angelis, P. (2008). The evolution of the TOEFL. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 27–54). New York, NY: Routledge.
Google Scholar
Trites, L., & McGroarty, M. (2005). Reading to learn and reading to integrate: New tasks for reading comprehension tests? Language Testing, 22(2), 174–210.
10.1191/0265532205lt299oa
Google Scholar
Valencia, S. W., Smith, A. T., Reece, A. M., Li, M., Wixson, K. K., & Newman, H. (2010). Oral reading fluency assessment: Issues of construct, criterion, and consequential validity. Reading Research Quarterly, 45(3), 270–91.
10.1598/RRQ.45.3.1
Web of Science® Google Scholar
C. Weir, & M. Milanovic (Eds.). (2003). Continuity and innovation: Revising the Cambridge Proficiency in English examination 1913–2002. Cambridge, England: Cambridge University Press.
Google Scholar
Wiliam, D. (2010). An integrative summary of the research literature and implications for a new theory of formative assessment. In H. Andrade & G. Cizek (Eds.), Handbook of formative assessment (pp. 18–40). New York, NY: Routledge.
Google Scholar
Yu, G. (2008). Reading to summarize in English and Chinese: A tale of two languages. Language Testing, 25(4), 521–51.
10.1177/0265532208094275
Web of Science® Google Scholar

Citing Literature

The Companion to Language Assessment

Browse other articles of this reference work:

BROWSE TABLE OF CONTENTS