基于流水模式的古籍文献汉字切分算法  被引量:6

A Character Segmentation Algorithm for Ancient Chinese Literature Based on“Water Flow”Model

在线阅读下载全文

作  者:倪劼[1] NI Jie

机构地区:[1]南京图书馆

出  处:《图书馆论坛》2021年第9期141-149,共9页Library Tribune

摘  要:古籍文献汉字切分作为古籍数字化基础工作之一,其中交错、粘连文字切分一直是研究的重点与难点,开展切分技术方式研究可以提高文字切分的准确性与适用性,在促进古籍数字化工作方面具有重要的意义。文章根据古籍文献汉字呈现的特征,借鉴流水模式的思路,提出古籍文献汉字切分新方式。首先,对古籍文献图像进行预处理;然后利用投影法与图像形态学处理实现列切分;最后在列基础上进行逐字切分。在字切分时如遇到交错与粘连情况,则先通过阈值划出待切分区域,在此区域内借鉴水流下落时呈现的运动轨迹作为切分依据,实现古籍文献汉字切分,并将此方式命名为流水算法。以6本古籍文献为例,对算法效果进行实践,样本共计14,503字,最终切分精准率为99.00%,召回率为95.62%,F值为97.27%。实验表明,流水算法在不同类型古籍文献中对间隔、交错、粘连汉字均能实现有效切分。Chinese character segmentation is essential for ancient book digitization.However,segmenting overlapped or connected characters is always a challenge.The research on segmentation techniques can not only improve the accuracy and applicability of character segmentation,but also has positive effects on promoting the digitization of ancient Chinese books.In view of the characteristics of Chinese characters in ancient texts,a new method of segmentation is proposed based on the idea of flowing water.Firstly,images of ancient books are preprocessed.Then,projection method and image morphological processing are used to achieve column segmentation.Finally,word-for-word segmentation is carried out within a column.In cases of word overlapping or connecting occur,the part to be segmented will be demarcated by a threshold value.In this part,a waterfall like motion tracking is used as the basis of segmentation.It is named as water flow algorithm.The algorithm has been applied to six ancient Chinese works with a sample of 14,503 characters for trial purpose.The final segmentation accuracy rate is 99.00%with a recall rate of 95.62%and 97.27%F value.The experiment shows that water flow algorithm can effectively segment separated,overlapped and connected Chinese characters in different types of ancient books.

关 键 词:古籍数字化 汉字切分 流水算法 

分 类 号:G255.1[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象