融入夸张表征的中文反讽识别方法

Detecting Chinese Sarcasm with Hyperbolic Representation

作　　者：李书羽朱广丽[1,3] 李嘉伟段文杰周若彤张顺香 Li Shuyu;Zhu Guangli;Li Jiawei;Duan Wenjie;Zhou Ruotong;Zhang Shunxiang(School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan 232001,China;School of Computer,Huainan Normal University,Huainan 232038,China;Institute of Artificial Intelligence,Hefei Comprehensive National Science Center,Hefei 230088,China)

机构地区：[1]安徽理工大学计算机科学与工程学院,淮南232001 [2]淮南师范学院计算机学院,淮南232038 [3]合肥综合性国家科学中心人工智能研究院,合肥230088

出　　处：《数据分析与知识发现》2025年第2期1-11,共11页Data Analysis and Knowledge Discovery

基　　金：国家自然科学基金面上项目(项目编号:62076006);认知智能全国重点实验室开放课题(项目编号:COGOS-2023HE02);安徽高校协同创新项目(项目编号:GXXT-2021-008)的研究成果之一。

摘　　要：【目的】为解决中文反讽短文本中存在的特征稀疏问题,提出一种融入夸张表征的中文反讽识别方法,挖掘短文本中的夸张表征以提升中文反讽识别准确率。【方法】通过点互信息和语义相似度计算分别获取与反讽领域相关的共现词对集、感叹词集与程度副词集,合并上述词集构建夸张表征词典;然后,通过正则表达式匹配反讽文本得到特殊标点符号序列并经独热编码获得特殊标点符号特征,采用RoBERTa-wwm-ext模型获取文本语义特征,利用WoBERT模型将夸张表征词典内的词及词对转化为动态词向量,获取夸张表征;最后,改进多头注意力机制,同时关注文本语义特征、夸张表征以及特殊标点符号特征,经Softmax函数得到识别结果。【结果】将公开的Ciron和ChineseSarcasm-Corpus数据集合并后进行实验,本文方法准确率达81.49%,F1值达81.24%。【局限】构建的夸张表征词典依赖语料质量,泛化能力有限。【结论】本文方法通过挖掘中文反讽短文本中存在的夸张表征,结合文本语义信息,能有效丰富文本语义表示,提升中文反讽识别的准确率。[Objective]To address the issue of feature sparsity in Chinese ironic short texts,this paper proposes a sarcasm detection method integrating hyperbolic representations.It aims to enhance the accuracy of Chinese sarcasm recognition by extracting hyperbolic representations from short texts.[Methods]Firstly,we used pointwise mutual information and semantic similarity computation to obtain co-occurring word pairs,interjections,and degree adverbs related to sarcasm.We also merged these word sets to construct a hyperbolic representation lexicon.Then,we used the regular expression to match sarcastic texts and obtained a sequence of special punctuations.We extracted these punctuations'special features with one-hot encoding.The RoBERTawwm-ext model is employed to extract semantic features from the text.The WoBERT method transformed the words and word pairs within the hyperbolic representation lexicon into dynamic word vectors,obtaining the hyperbolic representation.Finally,we introduced an improved multi-attention mechanism to focus on text semantics,hyperbolic representations,and special punctuation features and obtained the recognition results through the Softmax function.[Results]We examined the proposed method with merged publicly available Ciron and ChineseSarcasm-Corpus datasets,achieving an accuracy of 81.49%and an F,value of 81.24%.[Limitations]The constructed hyperbolic representation lexicon relies on corpus quality and has limited generalization ability.[Conclusions]The proposed method can effectively enrich semantic representation and improve the accuracy of Chinese sarcasmdetection.

关键词：中文反讽领域词典夸张表征 RoBERTa-wwm-ext 多头注意力机制

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融入夸张表征的中文反讽识别方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融入夸张表征的中文反讽识别方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索