Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Authors: Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

Year: 2024

Source: https://arxiv.org/abs/2405.19694

TLDR:

The document discusses a multi-agent grading system called "Grade-Like-a-Human" that aims to improve the performance of Large Language Models (LLMs) in grading tasks. The system divides the grading process into three stages: rubric generation, grading, and post-grading review, drawing inspiration from human grading methodologies. It also introduces a new dataset called OS for evaluating the system's performance. The document highlights the challenges and limitations of LLMs in grading tasks and proposes a systematic approach to address them. Now, let's address your question. The "Grade-Like-a-Human" system is a multi-agent grading framework designed to enhance the performance of Large Language Models (LLMs) in grading tasks. It divides the grading process into three stages: rubric generation, grading, and post-grading review, inspired by human grading methodologies. The system aims to improve the accuracy, consistency, and fairness of grading results. Additionally, the document introduces a new dataset called OS for evaluating the system's performance.

Free Login To Access AI Capability

Free Access To ChatGPT

The document discusses the development of a multi-agent grading system called "Grade-Like-a-Human" that aims to improve the performance of Large Language Models (LLMs) in grading tasks by dividing the grading process into three stages: rubric generation, grading, and post-grading review. It also introduces the OS dataset and Mohler dataset for evaluating the system's performance and provides insights into the challenges and proposed solutions for automated grading.

Free Access to ChatGPT

Abstract

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

Method

The paper proposes a multi-agent grading system called "Grade-Like-a-Human" that aims to enhance the performance of Large Language Models (LLMs) in grading tasks by systematically dividing the grading process into three stages: rubric generation, grading, and post-grading review, drawing inspiration from human grading methodologies. The system incorporates students' answers into rubric design, utilizes sampling-based iterative generation for rubric optimization, and employs specialized agents for each stage to improve accuracy, consistency, and fairness in grading. Additionally, the paper introduces the OS dataset for evaluating the system's performance and highlights the limitations and areas for improvement, such as domain specificity, time efficiency, and token cost. The proposed approach demonstrates significant performance improvements in automated grading tasks.

Main Finding

The main finding of the paper is the development of a multi-agent grading system called "Grade-Like-a-Human," which systematically divides the grading process into three stages: rubric generation, grading, and post-grading review, aiming to enhance the performance of Large Language Models (LLMs) in grading tasks. The system demonstrates significant improvements in accuracy, consistency, and fairness in grading, as evidenced by comprehensive experiments conducted on the OS dataset and the Mohlar dataset. Additionally, the paper highlights the limitations and areas for improvement in the proposed system, such as domain specificity, time efficiency, and token cost, providing valuable insights for future research in automated grading systems based on LLMs.

Conclusion

The conclusion of this paper is that the proposed multi-agent grading system, "Grade-Like-a-Human," systematically addresses the entire grading procedure, including rubric development, accurate scoring, and post-grading review, aiming to enhance the performance of Large Language Models (LLMs) in automated grading tasks. The system's effectiveness is demonstrated through extensive experiments conducted on the OS dataset and the widely used Mohler dataset, providing valuable insights for the development of automated grading systems based on LLMs.

Keywords

Large language model, Automated grading, Rubric generation, Grading review, Multi-agent system, OS dataset, Mohler dataset, Grading system, LLMs, Natural Language Processing, NLP tasks, Grading process, Educational technology, Academic assessment.

The Best AI PDF Reader