面向多领域的词汇复杂度评估研究  

Lexical Complexity Prediction Research for Multiple Domains

在线阅读下载全文

作  者:李纲[1] 黄建飞 毛进[1] Li Gang;Huang Jianfei;Mao Jin(Center for Studies of Information Resources,Wuhan University,Wuhan 430072,China;School of Information Management,Wuhan University,Wuhan 430072,China)

机构地区:[1]武汉大学信息资源研究中心,武汉430072 [2]武汉大学信息管理学院,武汉430072

出  处:《数据分析与知识发现》2024年第7期44-55,共12页Data Analysis and Knowledge Discovery

基  金:国家社会科学基金重大项目(项目编号:22&ZD326)的研究成果之一。

摘  要:【目的】探索集成不同语料库的方式,从而提升评估词汇复杂程度的综合表现。【方法】提出一种多领域词汇复杂度评估模型,通过特征泛化模块适应各种领域,在下游微调任务中学习词汇复杂度预测,通过特征融合模块探索手工特征与神经网络深度特征的组合意义。【结果】在LCP-2021数据集上,本文模型相较于公开的现有最优结果,Pearson系数、MAE、MSE指标分别提升0.014 8、0.001 7、0.000 4,Spearman系数和R2系数的表现则下降0.003 8、0.025 5;集成手工特征后没有明显变化;二次迁移到CWI-2018数据集,本文模型在三个领域上的MAE指标,相较公开的基线结果分别提升0.008 6、0.020 9、0.017 4。【局限】采用向量拼接集成手工特征和深度特征,未能充分融合不同类型特征;设计特征泛化模块时的算法选择具有一定局限性;可以进一步尝试构建综合数据集。【结论】集成不同语料库,有助于提升模型在新领域下的整体评估效果。[Objective] To explore methods for integrating different corpora to improve the overall performance of vocabulary complexity assessment.[Methods] This study proposes a multi-domain vocabulary complexity assessment model.The feature generalization module is designed to adapt to different domains.In subsequent finetuning tasks,the model learns to predict vocabulary complexity.The feature fusion module is employed to explore the combined significance of hand-crafted features and deep features extracted by neural networks.[Results] On the LCP-2021 dataset,compared to the existing public optimal results,our model improved the Pearson correlation coefficient,MAE,and MSE by 0.0148,0.0017,and 0.0004 respectively.However,the Spearman correlation coefficient and R2 coefficient decreased by 0.0038 and 0.0255 respectively.There was no significant change after integrating hand-crafted features.When transferred to the CWI-2018 dataset,our model improved the MAE metrics in three new corpus domains by 0.0086,0.0209,and 0.0174 compared to the public baseline results.[Limitations] The method of vector concatenation could not effectively integrate the hand-crafted features and deep features effectively.The choice of algorithm for the design of the feature generalization module has certain limitations.Further attempts can be made to construct a comprehensive dataset.[Conclusions] Integrating different corpora helps to improve the overall evaluation performance of the model in new domains.

关 键 词:多领域 词汇复杂度 领域泛化 特征融合 

分 类 号:G350[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象