CFGT:一种基于词典的中文地址要素解析模型  

CFGT:A Lexicon-based Chinese Address Element Parsing Model

在线阅读下载全文

作  者:黄威 沈耀迪 陈松龄 傅湘玲 HUANG Wei;SHEN Yaodi;CHEN Songling;FU Xiangling(School of Computer Science(National Pilot Software Engineering School),Beijing University of Posts and Telecommunications,Beijing 100876,China;Key Laboratory of Trustworthy Distributed Computing and Service(BUPT),Ministry of Education,Beijing 100876,China)

机构地区:[1]北京邮电大学计算机学院(国家示范性软件学院),北京100876 [2]可信分布式计算与服务教育部重点实验室,北京100876

出  处:《计算机科学》2024年第9期233-241,共9页Computer Science

基  金:国家自然科学基金(72274022)。

摘  要:地址要素解析作为地理编码过程中的关键环节,直接影响到地理编码的准确性。由于中文地址表达的多样性和复杂性,两段相似的地址文本在地理表示上却可能完全不同。传统的通过词典匹配进行地址要素解析的方法无法较好地应对歧义词,从而导致识别准确率欠佳。文中提出一种基于词典的中文地址要素解析模型(Collaborative Flat-Graph Transformer,CFGT),利用自匹配词、最近上下文等词汇信息增强地址文本字符序列表示,有效遏制了地址文本表达的歧义性。具体地,模型首先构建Flat-Lattice和Flat-Shift两种协作图,为地址字符捕获自匹配词和最近上下文词汇的知识,并设计融合层实现图之间的协作;其次,通过改进的相对位置编码,进一步强化词信息对地址文本字符序列的增强效果;最后,利用Transformer和条件随机场进行地址要素解析。在Weibo和Resume等多个公开数据集及Address私有数据集上开展的实验表明,CFGT模型的性能优于已有的中文地址要素解析模型和中文命名实体识别模型。As a key step in the geocoding process,address element parsing directly affects the accuracy of geocoding.Due to the diversity and complexity of Chinese address expressions,two similar address texts may be completely different in geographical representation.Traditional address element parsing based on dictionary matching cannot handle ambiguous words well,thus showing poor recognition accuracy.A lexicon-based Chinese address element parsing model CFGT:collaborative flat-graph transformer is proposed,which uses self-matched words,nearest contextual and other lexical information to enhance the character sequence representation of address text,effectively curbing the ambiguity of address text expression.Specifically,the model first constructs two collaboration graphs,flat-lattice and flat-shift,to capture the knowledge of self-matched words and nearest contextual words for address characters,and designs a fusion layer to implement collaboration between graphs.Secondly,with the help of the improved relative position encoding,the enhancing effect of word information on the address text character sequence is further strengthened.Finally,Transformer and conditional random fields are used to analyze address elements.Experiments are conducted on multiple public datasets such as Weibo and Resume,as well as the private dataset Address.Experimental results show that the performance of the CFGT is superior to previous Chinese address element parsing models and existing models in the field of Chinese named entity recognition.

关 键 词:中文地址识别 词典强化 外部信息 命名实体识别 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象