基于马尔可夫动态编码的谷歌图书语料库质量方法  

Google Books Corpus Quality Method Based on Markov Dynamic Coding

在线阅读下载全文

作  者:宋玉玲 

机构地区:[1]温州大学计算机与人工智能学院,浙江 温州

出  处:《计算机科学与应用》2023年第4期745-753,共9页Computer Science and Application

摘  要:语料库是自然语言处理任务的关键,谷歌图书语料库是迄今为止最大的历时语料库,被广泛应用于从时间、空间维度上评估学科、语言甚至是文化等领域在社会发展中的现象和规律,但因其构建过程中的识别问题、元数据问题等原因被很多学者质疑。目前常见的处理方法主要是从语料库中提取所有可能的数据和从原数据进行预处理,这些方法耗时且费力。本文提出将语料库噪声问题转化为时间序列异常检测问题,使用传统的时间序列模型和马尔可夫动态编码去实现时间序列异常检测。实验结果表明,马尔可夫不仅可以保存时间相关性和频率结构,而且提供了一种自然的反向操作——将图形映射回时间序列,克服了传统时间序列模型的缺点,最终有效地解决了语料库的局部质量对齐问题。The corpus is the key to natural language processing tasks. The Google Books corpus is by far the largest ephemeral corpus, which is widely used to evaluate the phenomena and patterns of disciplines, languages, and even cultures in social development from temporal and spatial dimensions, but it has been questioned by many scholars due to the identification problem and metadata problem in its construction. The current common processing methods mainly extract all possible data from the corpus and preprocess from the original data, which are time-consuming and laborious. In this paper, we propose to transform the corpus noise problem into a time series anomaly detection problem by using the traditional time series model and Markov dynamic coding to achieve time series anomaly detection. Experimental results show that Markov not only preserves temporal correlation and frequency structure, but also provides a natural inverse operation—mapping graphs back to time series, which overcomes the shortcomings of the traditional time series model and finally effectively solves the local quality alignment problem of the corpus.

关 键 词:谷歌图书语料库 马尔可夫模型 时间序列异常检测 

分 类 号:H31[语言文字—英语]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象