Educational Measurement: Issues and Practice

A Framework for Evaluation and Use of Automated Scoring

David M. Williamson

David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .

Search for more papers by this author

Xiaoming Xi,

Xiaoming Xi

David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .

Search for more papers by this author

F. Jay Breyer,

F. Jay Breyer

David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .

Search for more papers by this author

David M. Williamson,

David M. Williamson

David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .

Search for more papers by this author

Xiaoming Xi,

Xiaoming Xi

David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .

Search for more papers by this author

F. Jay Breyer,

F. Jay Breyer

David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .

Search for more papers by this author

First published: 22 March 2012

https://doi.org/10.1111/j.1745-3992.2011.00223.x

Citations: 204

Share a link

Email
Wechat
Bluesky

Abstract

A framework for evaluation and use of automated scoring of constructed-response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high-stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.

References

Attali, Y. (2009, April). Evaluating automated scoring for operational use in consequential language assessment—the ETS experience. Paper presented at the meeting of the National Council on Measurement in Education, San Diego , CA .
Google Scholar
Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. The Journal of Technology, Learning, and Assessment, 10(3), 1–15. Retrieved from <http://www.jtla.org>, accessed October 11, 2010.
Google Scholar
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, 4(3), 1–30. Retrieved from <http://journals.bc.edu/ojs/index.php/jtla/article/view/1650/1492>, accessed January 3, 2012.
Google Scholar
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It's not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17.
10.1111/j.1745-3992.1998.tb00631.x
Google Scholar
Bernstein, J., De Jong, J., Pisoni, D., & Townshend, B. (2000). Two experiments on automatic scoring of spoken language proficiency. In Proceedings of InSTIL2000 (Integrating Speech Technology in Learning) (pp. 57–61). Dundee , Scotland : University of Abertay.
Google Scholar
Bridgeman, B., Powers, D., Stone, E., & Mollaun, P. (2012). TOEFL iBT speaking test scores as indicators of oral communicative language proficiency. Language Testing, 29, 1–18.
10.1177/0265532211411078
Web of Science® Google Scholar
Bridgeman, B., Trapani, C., & Attali, Y. (2009, April). Considering fairness and validity in evaluating automated scoring. Paper presented at the meeting of the National Council on Measurement in Education, San Diego , CA .
Google Scholar
Bridgeman, B., Trapani, C., & Williamson, D. M. (2011, April). The question of validity of automated essay scores and differentially valued evidence. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans , LA .
Google Scholar
Burstein, J. (2003). The e-rater® scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113–121). Hillsdale , NJ : Lawrence Erlbaum Associates.
Google Scholar
Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998a, April). Computer analysis of essays. Paper presented at the meeting of the National Council on Measurement in Education, Montreal , Canada .
Google Scholar
Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M. D. (1998b). Automated scoring using a hybrid feature identification technique. In Proceedings of the Annual Meeting of the Association of Computational Linguistics, 1998 (pp. 206–210). Montreal , Canada : ACL.
Google Scholar
Callear, D., Jerrams-Smith, J. & Soh, V. (2001). Bridging gaps in computerized assessment. In Proceedings of the International Conference of Advanced Learning Technologies 2001 (pp. 139–140). Madison , WI : ICALT.
10.1109/ICALT.2001.943881
Web of Science® Google Scholar
Chevalier, S. (2007). Speech interaction with Saybot player, a CALL software to help Chinese learners of English. In Proceedings of the International Speech Communication Association Special Interest Group on Speech and Language Technology in Education (SLaTE) (pp. 37–40). Farmington , PA : ISPA.
Google Scholar
Clauser, B. E., Kane, M. T., & Swanson, D. B. (2002). Validity issues for performance-based tests scored with computer-automated scoring systems. Applied Measurement in Education, 15(4), 413–432.
10.1207/S15324818AME1504_05
Web of Science® Google Scholar
Culham, R. (2003). 6 + 1 traits of writing: The complete guide. New York , NY : Scholastic, Inc.
Google Scholar
Davey, T. (2009, April). Principles for model building, scaling and evaluation of automated scoring. Paper presented at the meeting of the National Council on Measurement in Education, San Diego , CA .
Google Scholar
DeVore, R. (2002, April). Considerations in the development of accounting simulations. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans , LA .
Google Scholar
Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring [Special issue]. Language Testing, 27(3), 317–334.
10.1177/0265532210363144
Web of Science® Google Scholar
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
10.1177/001316447303300309
PubMed Web of Science® Google Scholar
Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., Rossier, R., & Cesari, F. (2000). The SRI EduSpeak™ system: Recognition and pronunciation scoring for language learning. Proceedings of InSTILL (Integrating Speech Technology in Language Learning) . (pp. 123–128) Dundee , Scotland : University of Abertay.
Google Scholar
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535.
10.1037/0033-2909.112.3.527
Web of Science® Google Scholar
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 18–64). Washington , DC : American Council on Education/Praeger.
Google Scholar
Katz, I. R., & Smith-Macklin, A. (2007). Information and communication technology (ICT) literacy: Integration and assessment in higher education. Journal of Systemics, Cybernetics, and Informatics, 5(4), 50–55.
Google Scholar
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Hillsdale , NJ : Lawrence Erlbaum Associates.
Google Scholar
Leacock, C., & Chodorow, M. (2003). C-rater: Scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.
10.1023/A:1025779619903
Web of Science® Google Scholar
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15–21.
10.3102/0013189X020008015
Google Scholar
Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234–250.
10.1111/j.1745-3984.1994.tb00445.x
Web of Science® Google Scholar
Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. Williamson, R. Mislevy, & I. Bejar (Eds.), Automated scoring of complex tasks in computer based testing (pp. 123–167). Hillsdale , NJ : Lawrence Erlbaum Associates.
Google Scholar
Mitchell, T., Russell, T., Broomhead, P., & Aldridge, N. (2002). Towards robust computerized marking of free-text responses. In Proceedings of the 6th International Computer Assisted Assessment Conference . (pp. 233–249), Loughborough , UK : Loughborough University.
Google Scholar
Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243.
Google Scholar
Page, E. B. (1968). The use of the computer in analyzing student essays. International Review of Education, 14(2), 210–225.
10.1007/BF01419938
Google Scholar
Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of Experimental Education, 62(2), 127–142.
10.1080/00220973.1994.9943835
Web of Science® Google Scholar
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Hillsdale , NJ : Lawrence Erlbaum Associates.
Google Scholar
Page, E. B., & Dieter, P. (1995). The analysis of essays by computer. Final Report, U.S. Office of Education Project No. 6-1318. ERIC Document Reproduction Service No. ED 028 633. Storrs : University of Connecticut.
Google Scholar
Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan 76(7), 561–65.
Web of Science® Google Scholar
Pearson (2009, March). PTE academic automated scoring. Retrieved from: http://www.pearsonpte.com/SiteCollectionDocuments/AutomatedScoringUS.pdf, accessed April 3, 2009.
Google Scholar
Petersen N. S. (1997, March). Automated scoring of writing essays: Can such scores be valid? Paper presented at meeting of the National Council on Education, Chicago , IL .
Google Scholar
Powers, D. E. (2011). Scoring the TOEFL independent essay automatically: Reactions of test takers and test score users. (ETS Research Memorandum No. RM-11-34). Princeton , NJ : Educational Testing Service.
Google Scholar
Powers, D. E., Burstein, J., Chodorow, M. S., Fowles, M. E., & Kukich, K. (2002). Comparing the validity of automated and human scoring of essays. Educational Computing Research, 26, 407–425.
10.2190/CX92-7WKV-N7WC-JL0A
Google Scholar
Quinlan, T., Higgins, D., & Wolff, S. (2009). Evaluating the construct coverage of the e-rater® scoring engine. Research Report No. RR-09-01. Princeton , NJ : Educational Testing Service.
Google Scholar
Ramineni, C., Williamson, D. M., & Weng, V. (2011, April). Understanding mean score differences between e-rater® and humans for demographic-based groups in GRE®. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans , LA .
Google Scholar
Risse, T. (2007, September). Testing and assessing mathematical skills by a script based system. Paper presented at the 10th International Conference on Interactive Computer Aided Learning, Villach, Austria.
Google Scholar
Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4), 1–21. http://ejournals.bc.edu/ojs/index.php/jtla/issue/view/190
Google Scholar
Sargeant, J., Wood, M. M., & Anderson, S. M. (2004). A human-computer collaborative approach to the marking of free text answers. In Proceedings of the 8th International CAA Conference (pp. 361–370). Loughborough , UK : Loughborough University.
Google Scholar
Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Hillsdale , NJ : Lawrence Erlbaum Associates.
10.4324/9781410606860
Google Scholar
Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (1999, April). Trait ratings for automated essay grading. Paper presented at the meeting of the National Council on Measurement in Education, Montreal , Canada .
Google Scholar
Singley, M. K., & Bennett, R. E. (1998). Validation and extension of the mathematical expression response type: Applications of schema theory to automatic scoring and item generation in mathematics. GRE Board Professional Report. No. 93–24P. Princeton , NJ : Educational Testing Service.
Google Scholar
Sukkarieh, J. Z., & Pulman, S. G. (2005). Information extraction and machine learning: Auto-marking short free text responses to science questions. In Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED) (pp. 629–637). Amsterdam , The Netherlands : AEID.
Web of Science® Google Scholar
Traub, R. E., & Fisher, C. W. (1977). On the equivalence of constructed-response and multiple-choice tests. Applied Psychological Measurement, 1(3), 355–369.
10.1177/014662167700100304
Google Scholar
Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335–353.
10.1177/0265532210364406
Web of Science® Google Scholar
Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). “Mental model” comparison of automated and human scoring. Journal of Educational Measurement, 36(2), 158–184.
10.1111/j.1745-3984.1999.tb00552.x
Web of Science® Google Scholar
Xi, X. (2008). What and how much evidence do we need? Critical considerations in validating an automated scoring system. In C. A. Chapelle, Y. R. Chung, & J. Xu (Eds.), Towards adaptive CALL: Natural language processing for diagnostic language assessment (pp. 102–114). Ames , IA : Iowa State University.
Google Scholar
Xi, X. (2010a). Automated scoring and feedback systems—Where are we and where are we heading? Language Testing, 27(3), 291–300.
10.1177/0265532210364643
Web of Science® Google Scholar
Xi, X. (2010b). How do we go about investigating test fairness? Language Testing, 27(2), 147–170.
10.1177/0265532209349465
Web of Science® Google Scholar
Xi, X. (In press). Validity and the automated scoring of performance tests. In G. Fulcher & F. Davidson (Eds.), The handbook of language testing. New York : Routledge.
Google Scholar
Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRater v1.0. Research Report No. RR- 08–62. Princeton , NJ : Educational Testing Service.
Google Scholar
Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer automated scoring. Applied Measurement in Education, 15(4), 391–412.
10.1207/S15324818AME1504_04
Web of Science® Google Scholar
Zechner, K., & Bejar, I. (2006). Towards automatic scoring of non-native spontaneous speech. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (pp. 216–223). New York , NY : ACL.
Google Scholar

Citing Literature

Volume31, Issue1

Spring 2012

Pages 2-13

A Framework for Evaluation and Use of Automated Scoring

Abstract

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A Framework for Evaluation and Use of Automated Scoring

Abstract

References

Citing Literature

References

Related

Information