检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨威亚 余正涛[1,2] 高盛祥[1,2] 宋燃 YANG Weiya;YU Zhengtao;GAO Shengxiang;SONG Ran(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology)Kunming Yunnan 650500,China)
机构地区:[1]昆明理工大学信息工程与自动化学院,昆明650500 [2]云南省人工智能重点实验室(昆明理工大学),昆明650500
出 处:《计算机应用》2021年第10期2879-2884,共6页journal of Computer Applications
基 金:国家自然科学基金资助项目(61972196,61762056,61472168);云南省重大科技专项(202002AD080001);云南省高新技术产业专项(201606)。
摘 要:针对汉越跨语言新闻话题发现任务中汉越平行语料稀缺,训练高质量的双语词嵌入较为困难,而且新闻文本一般较长导致双语词嵌入的方法难以很好地表征文本的问题,提出一种基于跨语言神经主题模型(CL-NTM)的汉越新闻话题发现方法,利用新闻的主题信息对新闻文本进行表征,将双语语义对齐转化为双语主题对齐任务。首先,针对汉语和越南语分别训练基于变分自编码器的神经主题模型,从而得到单语的主题抽象表征;然后,利用小规模的平行语料将双语主题映射到同一语义空间;最后,使用K-means方法对双语主题表征进行聚类,从而发现新闻事件簇的话题。实验结果表明,所提方法相较于面向中英文的隐狄利克雷分配主题改进模型(ICE-LDA)在Macro-F1值与主题一致性上分别提升了4个百分点与7个百分点,可见所提方法可有效提升新闻话题的聚类效果与话题可解释性。In Chinese-Vietnamese cross-language news topic discovery task,the Chinese-Vietnamese parallel corpora are rare,it is difficult to train high-quality bilingual word embedding,and the news text is generally long,so that the method of bilingual word embedding is difficult to represent the text well. In order to solve the problems,a Chinese-Vietnamese news topic discovery method based on Cross-Language Neural Topic Model(CL-NTM)was proposed. In the method,the news topic information was used to represent news text,and the bilingual semantic alignment was converted into bilingual topic alignment tasks. Firstly,the neural topic models based on the variational autoencoder were trained in Chinese and Vietnamese respectively to obtain the monolingual abstract representations of the topics. Then,a small-scale parallel corpus was used to map the bilingual topics into the same semantic space. Finally,the K-means method was used to cluster the bilingual topic representations for finding the topics of news event clusters. Experimental results show that,compared with the Improved Chinese-English Latent Dirichlet Allocation model(ICE-LDA),the proposed method increases the Macro-F1 value and topic-coherence by 4 percentage points and 7 percentage points respectively,showing that the proposed method can effectively improve the clustering effect and topic interpretability of news topics.
关 键 词:跨语言 主题对齐 神经主题模型 K-MEANS聚类 话题发现
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.154