检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:宋玉玲
出 处:《计算机科学与应用》2023年第4期745-753,共9页Computer Science and Application
摘 要:语料库是自然语言处理任务的关键,谷歌图书语料库是迄今为止最大的历时语料库,被广泛应用于从时间、空间维度上评估学科、语言甚至是文化等领域在社会发展中的现象和规律,但因其构建过程中的识别问题、元数据问题等原因被很多学者质疑。目前常见的处理方法主要是从语料库中提取所有可能的数据和从原数据进行预处理,这些方法耗时且费力。本文提出将语料库噪声问题转化为时间序列异常检测问题,使用传统的时间序列模型和马尔可夫动态编码去实现时间序列异常检测。实验结果表明,马尔可夫不仅可以保存时间相关性和频率结构,而且提供了一种自然的反向操作——将图形映射回时间序列,克服了传统时间序列模型的缺点,最终有效地解决了语料库的局部质量对齐问题。The corpus is the key to natural language processing tasks. The Google Books corpus is by far the largest ephemeral corpus, which is widely used to evaluate the phenomena and patterns of disciplines, languages, and even cultures in social development from temporal and spatial dimensions, but it has been questioned by many scholars due to the identification problem and metadata problem in its construction. The current common processing methods mainly extract all possible data from the corpus and preprocess from the original data, which are time-consuming and laborious. In this paper, we propose to transform the corpus noise problem into a time series anomaly detection problem by using the traditional time series model and Markov dynamic coding to achieve time series anomaly detection. Experimental results show that Markov not only preserves temporal correlation and frequency structure, but also provides a natural inverse operation—mapping graphs back to time series, which overcomes the shortcomings of the traditional time series model and finally effectively solves the local quality alignment problem of the corpus.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.192