检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:戴均豪
机构地区:[1]中铁第一勘察设计院集团有限公司,西安710043
出 处:《科技创新与应用》2022年第35期89-92,共4页Technology Innovation and Application
基 金:中国铁建重大专项(2021-A02)。
摘 要:随着铁路工程地质工作的不断开展,相关文本资料大量累积。但由于文本具有非结构化、不直观等特点,难以在信息化进程中得到高效利用。为将文本资料转化为计算机可直接读取的形式,该文面向铁路工程地质领域,收集文献、报告、规范及手册等多种类文本,利用Jiaba函数库,构建4192189词规模的铁路工程地质语料库;利用Word2vec模型,将非结构化文本分词嵌入词向量空间中,转化为具有语义信息的数值。经过降维可视化、聚类和语义相似度计算的检验,结果表明,该文构建的语料库及其所训练的词向量能有效记录语义信息。为铁路工程地质语义分析、实体识别和知识图谱构建等工作提供数据基础。With the continuous development of railway engineering geological work,a large number of related text materials have been accumulated.However,because the text is unstructured and unintuitive,it is difficult to be used efficiently in the process of informatization.In order to transform the text data into a form that can be directly read by computer,this paper collects documents,reports,specifications,manuals and other kinds of texts in the field of railway engineering geology,uses Jieba Chinese word segmentation technology to build a railway engineering geological corpus with a scale of 4192189 words,and uses Word2vec model to embed unstructured text word segmentation into word vector space and transform it into numerical values with semantic information.Through the tests of dimensionality reduction visualization,clustering and semantic similarity calculation,the results show that the corpus constructed in this paper and its trained word vectors can effectively record semantic information,thus providing a data basis for semantic analysis of railway engineering geology,entity recognition,knowledge graph construction and so on.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3