A Framework for Evaluation and Use of Automated Scoring
David M. Williamson
David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .
Search for more papers by this authorXiaoming Xi
David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .
Search for more papers by this authorF. Jay Breyer
David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .
Search for more papers by this authorDavid M. Williamson
David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .
Search for more papers by this authorXiaoming Xi
David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .
Search for more papers by this authorF. Jay Breyer
David M. Williamson, Xiaoming Xi, and F. Jay Breyer, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; [email protected] .
Search for more papers by this authorAbstract
A framework for evaluation and use of automated scoring of constructed-response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high-stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.
References
- Attali, Y. (2009, April). Evaluating automated scoring for operational use in consequential language assessment—the ETS experience. Paper presented at the meeting of the National Council on Measurement in Education, San Diego , CA .
- Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. The Journal of Technology, Learning, and Assessment, 10(3), 1–15. Retrieved from <http://www.jtla.org>, accessed October 11, 2010.
- Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, 4(3), 1–30. Retrieved from <http://journals.bc.edu/ojs/index.php/jtla/article/view/1650/1492>, accessed January 3, 2012.
-
Bennett, R. E., &
Bejar, I. I. (1998). Validity and automated scoring: It's not only the scoring.
Educational Measurement: Issues and Practice, 17(4), 9–17.
10.1111/j.1745-3992.1998.tb00631.x Google Scholar
- Bernstein, J., De Jong, J., Pisoni, D., & Townshend, B. (2000). Two experiments on automatic scoring of spoken language proficiency. In Proceedings of InSTIL2000 (Integrating Speech Technology in Learning) (pp. 57–61). Dundee , Scotland : University of Abertay.
- Bridgeman, B., Powers, D., Stone, E., & Mollaun, P. (2012). TOEFL iBT speaking test scores as indicators of oral communicative language proficiency. Language Testing, 29, 1–18.
- Bridgeman, B., Trapani, C., & Attali, Y. (2009, April). Considering fairness and validity in evaluating automated scoring. Paper presented at the meeting of the National Council on Measurement in Education, San Diego , CA .
- Bridgeman, B., Trapani, C., & Williamson, D. M. (2011, April). The question of validity of automated essay scores and differentially valued evidence. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans , LA .
- Burstein, J. (2003). The e-rater® scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113–121). Hillsdale , NJ : Lawrence Erlbaum Associates.
- Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998a, April). Computer analysis of essays. Paper presented at the meeting of the National Council on Measurement in Education, Montreal , Canada .
- Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M. D. (1998b). Automated scoring using a hybrid feature identification technique. In Proceedings of the Annual Meeting of the Association of Computational Linguistics, 1998 (pp. 206–210). Montreal , Canada : ACL.
- Callear, D., Jerrams-Smith, J. & Soh, V. (2001). Bridging gaps in computerized assessment. In Proceedings of the International Conference of Advanced Learning Technologies 2001 (pp. 139–140). Madison , WI : ICALT.
- Chevalier, S. (2007). Speech interaction with Saybot player, a CALL software to help Chinese learners of English. In Proceedings of the International Speech Communication Association Special Interest Group on Speech and Language Technology in Education (SLaTE) (pp. 37–40). Farmington , PA : ISPA.
- Clauser, B. E., Kane, M. T., & Swanson, D. B. (2002). Validity issues for performance-based tests scored with computer-automated scoring systems. Applied Measurement in Education, 15(4), 413–432.
- Culham, R. (2003). 6 + 1 traits of writing: The complete guide. New York , NY : Scholastic, Inc.
- Davey, T. (2009, April). Principles for model building, scaling and evaluation of automated scoring. Paper presented at the meeting of the National Council on Measurement in Education, San Diego , CA .
- DeVore, R. (2002, April). Considerations in the development of accounting simulations. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans , LA .
- Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring [Special issue]. Language Testing, 27(3), 317–334.
- Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
- Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., Rossier, R., & Cesari, F. (2000). The SRI EduSpeak™ system: Recognition and pronunciation scoring for language learning. Proceedings of InSTILL (Integrating Speech Technology in Language Learning) . (pp. 123–128) Dundee , Scotland : University of Abertay.
- Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535.
- Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 18–64). Washington , DC : American Council on Education/Praeger.
- Katz, I. R., & Smith-Macklin, A. (2007). Information and communication technology (ICT) literacy: Integration and assessment in higher education. Journal of Systemics, Cybernetics, and Informatics, 5(4), 50–55.
- Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Hillsdale , NJ : Lawrence Erlbaum Associates.
- Leacock, C., & Chodorow, M. (2003). C-rater: Scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.
-
Linn, R. L.,
Baker, E. L., &
Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria.
Educational Researcher, 20(8), 15–21.
10.3102/0013189X020008015 Google Scholar
- Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234–250.
- Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. Williamson, R. Mislevy, & I. Bejar (Eds.), Automated scoring of complex tasks in computer based testing (pp. 123–167). Hillsdale , NJ : Lawrence Erlbaum Associates.
- Mitchell, T., Russell, T., Broomhead, P., & Aldridge, N. (2002). Towards robust computerized marking of free-text responses. In Proceedings of the 6th International Computer Assisted Assessment Conference . (pp. 233–249), Loughborough , UK : Loughborough University.
- Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243.
-
Page, E. B. (1968). The use of the computer in analyzing student essays.
International Review of Education, 14(2), 210–225.
10.1007/BF01419938 Google Scholar
- Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of Experimental Education, 62(2), 127–142.
- Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Hillsdale , NJ : Lawrence Erlbaum Associates.
- Page, E. B., & Dieter, P. (1995). The analysis of essays by computer. Final Report, U.S. Office of Education Project No. 6-1318. ERIC Document Reproduction Service No. ED 028 633. Storrs : University of Connecticut.
- Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan 76(7), 561–65.
- Pearson (2009, March). PTE academic automated scoring. Retrieved from: http://www.pearsonpte.com/SiteCollectionDocuments/AutomatedScoringUS.pdf, accessed April 3, 2009.
- Petersen N. S. (1997, March). Automated scoring of writing essays: Can such scores be valid? Paper presented at meeting of the National Council on Education, Chicago , IL .
- Powers, D. E. (2011). Scoring the TOEFL independent essay automatically: Reactions of test takers and test score users. (ETS Research Memorandum No. RM-11-34). Princeton , NJ : Educational Testing Service.
-
Powers, D. E.,
Burstein, J.,
Chodorow, M. S.,
Fowles, M. E., &
Kukich, K. (2002). Comparing the validity of automated and human scoring of essays.
Educational Computing Research, 26, 407–425.
10.2190/CX92-7WKV-N7WC-JL0A Google Scholar
- Quinlan, T., Higgins, D., & Wolff, S. (2009). Evaluating the construct coverage of the e-rater® scoring engine. Research Report No. RR-09-01. Princeton , NJ : Educational Testing Service.
- Ramineni, C., Williamson, D. M., & Weng, V. (2011, April). Understanding mean score differences between e-rater® and humans for demographic-based groups in GRE®. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans , LA .
- Risse, T. (2007, September). Testing and assessing mathematical skills by a script based system. Paper presented at the 10th International Conference on Interactive Computer Aided Learning, Villach, Austria.
- Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4), 1–21. http://ejournals.bc.edu/ojs/index.php/jtla/issue/view/190
- Sargeant, J., Wood, M. M., & Anderson, S. M. (2004). A human-computer collaborative approach to the marking of free text answers. In Proceedings of the 8th International CAA Conference (pp. 361–370). Loughborough , UK : Loughborough University.
-
Shermis, M. D., &
Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective.
Hillsdale
,
NJ
: Lawrence Erlbaum Associates.
10.4324/9781410606860 Google Scholar
- Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (1999, April). Trait ratings for automated essay grading. Paper presented at the meeting of the National Council on Measurement in Education, Montreal , Canada .
- Singley, M. K., & Bennett, R. E. (1998). Validation and extension of the mathematical expression response type: Applications of schema theory to automatic scoring and item generation in mathematics. GRE Board Professional Report. No. 93–24P. Princeton , NJ : Educational Testing Service.
- Sukkarieh, J. Z., & Pulman, S. G. (2005). Information extraction and machine learning: Auto-marking short free text responses to science questions. In Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED) (pp. 629–637). Amsterdam , The Netherlands : AEID.
-
Traub, R. E., &
Fisher, C. W. (1977). On the equivalence of constructed-response and multiple-choice tests.
Applied Psychological Measurement, 1(3), 355–369.
10.1177/014662167700100304 Google Scholar
- Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335–353.
- Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). “Mental model” comparison of automated and human scoring. Journal of Educational Measurement, 36(2), 158–184.
- Xi, X. (2008). What and how much evidence do we need? Critical considerations in validating an automated scoring system. In C. A. Chapelle, Y. R. Chung, & J. Xu (Eds.), Towards adaptive CALL: Natural language processing for diagnostic language assessment (pp. 102–114). Ames , IA : Iowa State University.
- Xi, X. (2010a). Automated scoring and feedback systems—Where are we and where are we heading? Language Testing, 27(3), 291–300.
- Xi, X. (2010b). How do we go about investigating test fairness? Language Testing, 27(2), 147–170.
- Xi, X. (In press). Validity and the automated scoring of performance tests. In G. Fulcher & F. Davidson (Eds.), The handbook of language testing. New York : Routledge.
- Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRater v1.0. Research Report No. RR- 08–62. Princeton , NJ : Educational Testing Service.
- Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer automated scoring. Applied Measurement in Education, 15(4), 391–412.
- Zechner, K., & Bejar, I. (2006). Towards automatic scoring of non-native spontaneous speech. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (pp. 216–223). New York , NY : ACL.