检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:董艳云[1] 祁昕阳 马晓梅[1] DONG Yanyun;QI Xinyang;MA Xiaomei
机构地区:[1]西安交通大学
出 处:《语言测试与评价》2024年第2期13-30,共18页Language Testing and Assessment
摘 要:本研究旨在探索GPT-4用于小样本二语写作的评估能力,以雅思写作任务2为例,设计了包含六类指令的指令工程,通过数据分布、相关分析及一致性检验,逐步分析了GPT-4在不同指令窗口下的评分性能在实验集上的表现。结果发现:第一,“最简+标准+样例”指令为最佳,并在验证集上再次得到验证。在最佳指令下,GPT-4的评分与考官评分一致性较强,且具备强相关关系。第二,考官评价与评分标准和校标样例存在信息偏差,不宜作为指令资料,否则可能会对GPT-4形成干扰。本研究期望能为GPT-4在教育环境中的写作评估应用提供实证支持,为进一步探索其在课堂环境中的实施提供基础。This study aims to explore the assessment capability of GPT-4 for small-sample L2 writing.Taking IELTS Writing Task 2 as an example,this research employs“prompt engineering”strategy and designs 6 distinct prompts.By examining data distribution,interrater correlation,and inter-rater agreement,this study analyzes the scoring performance of GPT-4 under different prompt windows.It is found that the“minimal+criteria+examples”prompt yields the best results,which is further verified on the test set.Under the optimal prompt,GPT-4’s scoring shows strong consistency with the examiner’s scores and exhibits a strong correlation.Additionally,an information discrepancy was found between the examiner’s comments and the scoring criteria and calibration examples.The examiner’s comments would potentially undermine GPT-4’s assessment capabilities,so it is not recommended to include them into the prompts.This study aspires to contribute empirical insights into the practical application of GPT-4 for writing evaluation in educational settings,offering a foundation for further exploration and implementation in classroom contexts.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.166