基于大语言模型微调的主观题自动评分研究  

Research on Automatic Scoring of Subjective Questions under Large Language Model Fine-tuning

作  者:常正辉 朱丹浩[2] 龚鹏飞[1] Chang Zhenghui;Zhu Danhao;Gong Pengfei(Modern Educational Technology Center,Jiangsu Police Institute,Nanjing,Jiangsu,210031;Department of Criminal Science and Technology,Jiangsu Police Institute,Nanjing,Jiangsu,210031)

机构地区:[1]江苏警官学院现代教育技术中心,江苏南京210031 [2]江苏警官学院刑事科学技术系,江苏南京210031

出  处:《考试研究》2025年第2期96-108,共13页Examinations Research

基  金:江苏警官学院教改项目“数字时代下高校教学资源平台新形态的研究”(2023B14)。

摘  要:随着大语言模型技术的发展,基于Decoder-Only的预训练模型因其强大的语言理解能力和增强的文本生成能力,为主观题自动评分研究带来了新的思路。通过数据清洗与预处理,将主观题评分任务划分为4个子任务:评分标准解析、学生作答评分、总结得分和总得分。通过人工标注的方式构建1000条高质量的主观题自动评分微调数据和100条测试数据。选择Qwen-7B-Chat模型作为基座模型,在算力有限的条件下,该模型通过Lora方法结合DeepSpeed分布式训练即可完成微调。利用1000条数据对Qwen-7B-Chat模型进行微调,并在另外的100条测试数据集上进行性能测试。实验结果表明,使用基于Decoder-Only的大语言模型,在较小的算力条件下(两张NVIDIA 3090Ti显卡)和较少的微调数据量情况下,模型的平均分差仅为0.061,皮尔逊相关系数高达0.952,这一性能远高于未经过微调的基座模型Qwen-7B-Chat和GPT。研究证明,随着技术的进一步发展和优化,基于Decoder-Only的预训练模型有望在更多教育场景中发挥作用,不仅提高评分效率和准确性,还能为教育评价和教学反馈提供更多智能化解决方案。With the advancement of large language model technology,Decoder-Only pre-trained models,known for their robust language understanding capabilities and enhanced text generation abilities,have introduced new approaches to the research on automatic scoring of subjective questions.Applying large language models to the research of automatic scoring of subjective questions is a significant step in educational innovation in the new era.This paper delineates the process of data cleaning and preprocessing,breaking down the subjective question scoring task into four subtasks:scoring criteria analysis,student response scoring,score summarization,and total score calculation.To achieve this,we manually annotated 1,000 high-quality fine-tuning data entries for automatic subjective question scoring and 100 test data entries.Based on these data,the Qwen-7B-Chat model was chosen as the base model,which can be fine-tuned under limited computing power conditions by using the LoRA method combined with DeepSpeed distributed training.During the experiments,the Qwen-7B-Chat model was fine-tuned by using these 1,000 data entries,and its performance was tested on another set of 100 test data entries.The experimental results demonstrate that by using a Decoder-Only large language model,high accuracy can be achieved even under limited computing power conditions(two NVIDIA 3090Ti GPUs)and with a small amount of fine-tuning data.Specifically,the model's average score difference is only 0.061,and the Pearson correlation coefficient is as high as 0.952.This performance is significantly higher than that of the base model Qwen-7B-Chat and GPT without fine-tuning.This study proves that with further technological advancements and optimizations,Decoder-Only pre-trained models may play a more prominent role in various educational scenarios,not only enhancing scoring efficiency and accuracy,but also providing more intelligent solutions for educational assessment and teaching feedback.

关 键 词:主观题自动评分 大语言模型 Decoder-Only Qwen-7B-Chat模型 

分 类 号:G424.74[文化科学—课程与教学论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象