检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张竞文 崔诗尧 张兴华 苏涛宇 柳厅文[1,2] ZHANG Jingwen;CUI Shiyao;ZHANG Xinghua;SU Taoyu;LIU Tingwen(Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China;School of Cyber Security,University of Chinese Academy of Sciences,Beijing 101408,China)
机构地区:[1]中国科学院信息工程研究所,北京100093 [2]中国科学院大学网络空间安全学院,北京101408
出 处:《集成技术》2024年第5期53-63,共11页Journal of Integration Technology
基 金:国家重点研发计划项目(2021YFB3100600)。
摘 要:在多方会话中,判断消息之间的回复关系是对话领域的一项重要任务。现有的相关工作还未关注、解决以下两个数据分布方面的问题:长度较短的消息往往出现的频率更高,而短文本包含的语义信息较少,限制了模型的学习能力;存在回复关系的正样本数量往往远少于负样本数量,导致模型在训练过程中容易出现数据偏斜问题,降低了模型处理正样本的性能。针对上述两个问题,作者提出一个基于预训练语言模型的改进模型,首先通过动态查询窗口建模缓解短文本相关问题;然后通过位置驱动的正样本权重优化缓解正样本相关问题。与前人研究工作进行比对,实验结果表明,与基于预训练语言模型的基线模型相比,改进模型将召回率平均提升了15.7%。此外,还构建了一个采集自Telegram平台的新数据集,可为后续相关研究提供数据支持。In multi-party conversations,identifying the reply-to relation between messages is an important task in the dialogue domain.Existing efforts have not addressed the following two issues related to data distribution:shorter messages tend to appear more frequently,while shorter texts contain less semantic information,which limits the learning ability of the model;the number of positive samples with reply-to relation is often much less than the number of negative samples,leading to data skewness issue during training phase and reducing the model’s performance in processing positive samples.Aiming at the two issues,this paper proposes an improved model based on a pre-trained language model,which firstly mitigates the short text-related issue through dynamic inquiry window modeling;and then copes with the positive sample-related issue through position-driven positive sample weight optimization.The paper is compared with previous research,and the experimental results show that this paper’s work improves the recall metric by an average of 15.7%compared to the baseline model based on the pre-trained language model.In addition,this paper constructs a new dataset collected from the Telegram platform,which can provide data support for subsequent related studies.
关 键 词:多方对话 回复关系发现 查询窗口 数据分布 预训练语言模型
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.40