基于层级互信息聚类的烟草行业信息分类与编码设计  

Design of information classifying and coding for tobacco industry based on hierarchical mutual information clustering

在线阅读下载全文

作  者:王轶博 潘伟 张海涛 江涛 WANG Yibo;PAN Wei;ZHANG Haitao;JIANG Tao(Tobacco Economy Information Center,State Tobacco Monopoly Administration,Beijing 100045,China;Information Center,China Tobacco Hubei Industrial Co.,Ltd.,Wuhan 430040,China;Technology Center,China Tobacco Yunnan Industrial Co.,Ltd.,Kunming 650231,China)

机构地区:[1]国家烟草专卖局烟草经济信息中心,北京市100045 [2]湖北中烟工业有限责任公司信息中心,武汉市430040 [3]云南中烟工业有限责任公司技术中心,昆明市650231

出  处:《烟草科技》2024年第9期106-112,共7页Tobacco Science & Technology

基  金:中国烟草总公司重点研发项目“新一代信息技术融合创新与网信治理研究”(110202102049)。

摘  要:为满足全国烟草生产经营管理一体化平台建设对行业信息分类与编码的需求,按照“流程、实体、服务”三类数字对象对信息系统进行解构,结合烟草行业业务实际情况,提出层级互信息聚类算法(Hierarchical Mutual Information Clustering,HMIC),通过对文本数据进行自然语言处理,计算不同数字对象在不同分类层级的互信息,利用层次聚类算法对数字对象进行聚类,从而得到烟草行业信息分类,并在此基础上进行信息编码。将HMIC与常用聚类算法进行对比测试,结果表明:①所构建的HMIC模型的信息分类效果最好,其整体信息熵比使用欧氏距离的聚类算法降低约8.2%,比仅使用互信息矩阵的聚类算法降低约2.5%。②从信息量的角度对分类编码进行研究,能够更好地区分不同类别之间的差异,提高信息分类与编码的可用性。该技术可为指导信息系统项目全生命周期建设提供支持。To meet the needs of the construction of National Tobacco Production,Operation and Management Integrated Platform of the tobacco industry,information classifying and coding are developed.The information systems are decomposed according to three types of digital objects,namely“process,entity,and service”,and in conjunction with the real-life business of the tobacco industry,a hierarchical mutual information clustering(HMIC)algorithm is proposed.By conducting natural language processing on text data,the mutual information of different digital objects at different classification levels is calculated,and the hierarchical clustering algorithm is used to classify digital objects,thus obtaining tobacco industry information classification,and then information coding is completed based on the information classification.The HMIC algorithm was compared with commonly used clustering algorithms,the results showed that:1)The designed HMIC algorithm featured the best performance in information classifying,with its total information entropy reduced by about 8.2%compared with the clustering algorithm using Euclidean distance,and by about 2.5%compared with the clustering algorithm with mutual information matrix only.2)From the point of information content,the research of information classifying and coding could better distinguish the differences between different categories and improve their usability.This technology supports the guidance for the whole life cycle of information system project construction.

关 键 词:烟草行业 信息分类 信息编码 层级互信息聚类 数字对象 

分 类 号:TS46[农业科学—烟草工业]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象