检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:左世亮 刘稳良 ZUO Shi-liang;LIU Wen-liang(Shanghai Institute of Technology,Shanghai201418,China)
机构地区:[1]上海应用技术大学,上海201418
出 处:《计算机仿真》2021年第8期344-347,416,共5页Computer Simulation
基 金:上海市教委2020年上海高校实验技术队伍建设计划项目沪教委人([2020]30号)。
摘 要:为降低多源信息背景下平行语料库重复句段对翻译工作的干扰,提升去重效率,设计一种基于词频-逆向文件频率技术的平行语料库相似句段去重算法。构建平行语料库句子一级对齐关联,设计概率模型,挑选最大概率路径为对齐输出,运用基于长度的句子对齐方法,确立源语料库中语言单位与目标语言文本间的翻译关系;根据句段词表层特性与信息熵,从多源语料库中择取少量待选实例并进行泛化匹配,得到句段相似程度;根据单词主题相关性推导出单词权重,把专业术语单词长度当作分辨单词主题相关性的前提,正态拟合单词长度获得关键词权重公式,以权重大小区分句段含义,完成相似句段去重。实验结果证明,所提方法去重效率较好、精度较高,适用范围广,为语言服务企业的业务发展带来新的契机。Based on word frequency reverse file frequency technology,a parallel corpus similar sentence segment de-duplication algorithm was designed for reducing the interference of repeated sentence segments in the context of multi-source information,and improving the efficiency of de-duplication.Parallel corpus sentence alignment association was founded.The probability model was designed.The maximum probability path was selected as the alignment output.With the length-based sentence alignment method,the translation relationship between the language units in the source corpus and the target language text was determined.Through the surface features and information entropy of sentence segments,a small number of examples were selected from multi-source corpora,generalized and matched to obtain the similarity of sentence segments.According to the topic relevance of words,the weight of words was derived.The word length of technical terms was regarded as the premise to distinguish the topic relevance of words,and the word length was normally fitted to obtain the weight formula of keywords.Based on the weight,the meaning of sentence segments was distinguished,completing the de-duplication of similar sentence segments.The experimental results show that the method has high efficiency,high precision and wide application range,and has a good application prospect in language service enterprises.
关 键 词:多源信息 平行语料库 相似度 句段去重 句子对齐
分 类 号:TP351[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.188.96.1