融合词先验知识的MOOCs课程概念抽取  

Incorporating Word Prior Knowledge for MOOCs Course Concept Extraction

在线阅读下载全文

作  者:聂凡 刘德喜[1,2] 张子靖 刘喜平[1,2] 廖国琼[1,2] 万常选[1,2] NIE Fan;LIU Dexi;ZHANG Zijing;LIU Xiping;LIAO Guoqiong;WAN Changxuan(School of Computing and Artificial Intelligence,Jiangxi University of Finance and Economics,Nanchang,Jiangxi 330013,China;Jiangxi Key Laboratory of Data and Knowledge Engineering,Jiangxi University ofFinance and Economics,Nanchang,Jiangxi 330013,China)

机构地区:[1]江西财经大学计算机与人工智能学院,江西南昌330013 [2]江西财经大学数据与知识工程江西省高校重点实验室,江西南昌330013

出  处:《中文信息学报》2025年第1期101-111,120,共12页Journal of Chinese Information Processing

基  金:国家自然科学基金(62272206,62272205,62462034);江西省主要学科学术和技术带头人培养计划领军人才项目(20213BCJL22041);江西省自然科学基金(20212ACB202002,20242BAB25119);江西省教育厅科学技术研究项目(GJJ2200501)。

摘  要:针对中文大规模开放在线课程(Massive Open Online Courses,MOOCs)视频字幕中课程概念词性丰富、领域特性显著等特点,该文提出一种融合词性、词性规则和词典等词先验知识(Word Prior Knowledge,WPK)的课程概念抽取模型WPK-MCC。该模型首先通过BERT以及字符嵌入的方式获得包含上下文和词性信息的字符表示,再利用词典匹配当前字符所在窗口的字符串,构建当前字符的4个词汇集群(当前字符在词的开头、中间、结尾,以及当前字符单独成词),并通过词性规则控制每个词的贡献权重。此外,考虑到课程概念在MOOCs中有一定的重复性,WPK-MCC模型利用当前句子所在视频字幕的上下文信息,提升课程概念抽取的效果。在MoocData数据集上的实验结果表明,WPK-MCC模型对课程概念实体抽取的F_(1)值达到89.42%,优于SoftLexicon等先进的模型。消融实验显示,词性、规则和词典等词先验知识以及上下文全局信息对WPK-MCC模型的帮助较大,去除词先验知识和上下文全局信息后,WPK-MCC的F_(1)值下降了1.13%。A course concept extraction model,WPK-MCC,incorporating Word Prior Knowledge(WPK)including the part-of speech,the collocation of part-of-speech and the lexicon etc.,is proposed to capture the course concepts in video caption of Massive Open Online Courses(MOOCs)in Chinese.The model first encodes the character with context and part-of-speech information through BERT and character embedding,constructs four clusters of words containing the character(i.e.with the character at the beginning,middle,end of the word,or as a single word with the character),and assigns weight for each word according to the part-of-speech.Then,to leverage the repetition of course concept in MOOCs,the WPK-MCC model also include the sentence context of the video caption.Experiments on MoocData show that the F_(1) value of WPK-MCC for course concept entity extraction reaches 89.42%,which is superior to advanced models including SoftLexicon.

关 键 词:课程概念抽取 词先验知识 词汇集群 全局信息 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象