基于NLP的兴趣点数据上线系统设计与实现  

DESIGN AND IMPLEMENTATION OF POI DATA ONLINE SYSTEM BASED ON NLP

在线阅读下载全文

作  者:张先荣 郑贵俊 Zhang Xianrong;Zheng Guijun(School of Software Engineering,University of Science and Technology of China,Hefei 230051,Anhui,China)

机构地区:[1]中国科学技术大学软件学院,安徽合肥230051

出  处:《计算机应用与软件》2020年第12期17-25,共9页Computer Applications and Software

摘  要:全面丰富的兴趣点(Point of Interest,POI)数据直接影响着地图App厂商的地理位置服务。针对传统的POI数据采集与上线方式周期长、速度慢的问题,提出一种高效的采集、上线POI数据的方式。将数据上线工作细化为:数据采集,数据格式化,数据判重与存储。在数据采集模块上采用一种负载均衡的分布式网络爬虫采集技术,数据格式化模块用于处理数据采集模块采集出的原始数据格式不统一的问题。数据判重模块将新旧数据的名称进行相似度计算,再结合经纬度计算的距离进行判重。结合Word2Vec与Siamese-LSTM设计判重模型,准确率达93.5%。The comprehensive and abundant POI(Point of Interest)data directly affects the geographical location services of map App manufacturers.Aiming at the problems of long cycle and slow speed of traditional POI data collection and upload mode,an efficient way of collecting and upload POI data is proposed.The data upload work was divided into data collection,data formatting,data uniqueness and storage.The data collection module adopted a load balanced distributed Web crawler collection technology,and the data formatting module was used to deal with the inconsistency of the original data format collected by the data collection module.The data uniqueness module calculated the similarity between the old and new data names,and then judged the uniqueness by combining the distance calculated by longitude and latitude.Combining Word2Vec with Siamese-LSTM to design the uniqueness model,the accuracy is 93.5%.

关 键 词:数据采集 数据判重 POI数据 Word2Vec Siamese-LSTM 短文本相似度 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象