网络游记文本中旅游行程链提取方法  

A Method of Itinerary Chain Extraction from Online Travel Notes

在线阅读下载全文

作  者:阮陵 葛军莲[2] 张翎[2] 王黎淑 王晓宣 RUAN Ling;GE Junlian;ZHANG Ling;WANG Lishu;WANG Xiaoxuan(School of Geography and Tourism,Anhui Normal University,Wuhu 340200,China;School of Geography,Nanjing Normal University,Nanjing 210023,China;Anhui Province Key Laboratory of Physical Geographic Environment,Chuzhou University,Chuzhou 239000,China)

机构地区:[1]安徽师范大学地理与旅游学院,芜湖340200 [2]南京师范大学地理科学学院,南京210023 [3]滁州学院实景地理环境安徽省重点实验室,滁州239000

出  处:《地球信息科学学报》2024年第2期477-487,共11页Journal of Geo-information Science

基  金:国家自然科学基金项目(42301258、42171403)。

摘  要:网络游记是旅游者在互联网上发布的自述性旅游过程记录,描述了旅游的前后过程和感受体验。从网络游记文本中提取旅游行程链,分析行程结构,能给游客的行程制定、线路设计提供重要的参考。传统的游记文本行程提取大多依赖于人工识别文本中的行程节点,再进行串联、合并处理,工作量较大。自动提取游记文本中的旅游行程链,能够提高数据处理和分析效率。本文基于自然语言处理技术,在深入分析游记网络文本的段落结构和表达特点的基础上,归纳了行程节点和节点次序关系的句法表达规则,构建了行程节点触发词表,进而提出了基于句法规则的旅游行程链提取方法,主要包含行程节点的识别、节点次序关系的识别和旅游行程链的生成,能实现网络游记文本的旅游行程重构。本文采集了蚂蜂窝平台17 226篇南京市网络游记文本数据,采用最长公共子序列算法,开展了本文方法的试验验证。通过对比分析,本文方法提取的旅游行程链和人工识别的真实行程链相似度达到86.14%,高于实体关系抽取领域的BERT-BiLSTM-CasRel深度学习模型的83.1%。相比现有关系提取类深度学习方法需要开展大量的数据标注,本文方法计算更加便捷,准确率相对较高,仅需构建区域旅游点名录,即可实现网络游记文本中行程信息的自动提取。Online travel notes are self-reported records published by tourists on the Internet,which describe the process of their trip and experience.Extracting itinerary chain from online travel notes and analyzing itinerary structure,can provide important reference for tourists' itinerary formulation and route design.The traditional itinerary extraction mostly relies on manual recognition,and some methods proposed in current studies require extensive data annotation,which is a large workload.Automatic extraction of itinerary chain from online travel notes accurately can improve the efficiency of data processing,which is an open issue and worth of study.In this paper,a syntactic rule-based travel chain extraction method was proposed based on natural language processing technology,which includes the identification of travel nodes,the recognition of nodes order and the generation of itinerary chain.First of all,the paragraph structure and expression characteristics of itinerary in online travel notes were analyzed,and the syntactic expression rules of travel nodes and nodes order were summarized based on word segmentation and dependency syntax analysis of related statements.Secondly,the travel nodes matched by syntactic rules,can be divided into deterministic travel nodes,uncertain travel nodes and non-travel nodes.Thirdly,through regular expression and syntactic rules match,the order of travel nodes was recognized from the specific itinerary description statement.Finally,the uncertain travel nodes were distinguished based on nodes context analysis,and the sequential and cross-arranged travel nodes were merged and connected in series.Meanwhile,the order of nodes in the connected series were verified and adjusted based on previously recognized node orders,and the itinerary chain was generated.In order to verify the effectiveness of proposed method,17 226 online travel notes text data of Nanjing city were collected on Mafengwo platform,and the longest common subsequence algorithm was used to carry out the experimental verific

关 键 词:网络游记 网络文本 旅游行程链 行程重构 行程提取 节点识别 规则匹配 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术] F592.7[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象