面向司法文本的不均衡小样本数据分类方法  被引量:6

Classification method for unbalanced and small sample data in judicial documents

在线阅读下载全文

作  者:梁越 刘晓峰 李权树 白艳峰 马应龙[1] LIANG Yue;LIU Xiaofeng;LI Quanshu;BAI Yanfeng;MA Yinglong(School of Control and Computer Engineering,North China Electric Power University,Beijing 102206,China)

机构地区:[1]华北电力大学控制与计算机工程学院,北京102206

出  处:《计算机应用》2022年第S02期118-122,共5页journal of Computer Applications

基  金:国家重点研发计划项目(2018YFC0831404)。

摘  要:在目前的控申业务中,将案件分流到不同检察职能部门处理是核心任务之一。由于控申业务类型复杂、数据存在多源异构和不平衡性以及严重依赖人工鉴别,导致控申案件分流工作繁重低效,为此提出了类别不均衡小样本控申文本分类方法以智能辅助控申案件分流业务。首先,面向检察实务提出一套信访信件自动化智能化处理套件,对信访信件扫描便携式文件格式(PDF)进行图像提取、图像增强和图片光学字符识别,并通过TextRank算法对摘要和关键词提取文本特征,构建其基于变换器的双向编码器(BERT)文本向量表示。其次,提出了基于虚拟对抗训练(VAT)和Focal Loss函数的类别不均衡小样本控申文本分类方法,针对信访信件数量偏少且存在对抗样本的情况,在模型训练时引入VAT进行优化;同时,在采用分层抽样方法提高数据集质量的基础上,在训练中引入Focal Loss进行优化以解决数据不平衡问题。在实际控申数据集上,将优化后的模型与BERT表示模型对比。实验结果表明,基于VAT和Focal Loss的控申文本分类模型F1值达到0.85,相较于基准BERT模型F1值有约15个百分点的提高,具有很好的分类性能。In the current accusation businesses,dispatching judicial cases is one of the core tasks,but it is rather complicated and mainly made manually and the data related to businesses is heterogeneous and unbalanced from multiple sources,which makes dispatching of judicial cases inefficient.Aiming at those questions above,a text classification method for unbalanced and small sample data for assisting accusation case dispatching was proposed.Firstly,facing to the real-life prosecution requirements,an intelligent processing suite of prosecurate petition letters was presented.The letter images in prosecurate document saved as Portable Document Format(PDF)were extracted,enhanced and recognized by Optical Character Recognition(OCR),and text characteristics such as summary and key words were extracted by TextRank algorithm to construct BERT(Bidirectional Encoder Representations from Transformers)based text vector representation.Secondly,an accusation text classification method based on Virtual Adversarial Training(VAT)and Focal Loss for category unbalanced and small sample data was proposed.VAT was introduced to optimize model training for dealing with small amount of prosecurate petition letters and existance of adversarial samples.Focal Losss was introduced to optimize model training for unbalanced data after improving dataset quality by hierarchical sampling method.Compared with the BERT representation model on real-world accusation dataset.The experimental results showed that the accusation text classification model based on VAT and Focal Loss had superior classification performance,whose F1 score could reach 0.85,about 15 percentage points higher than that of the standard BERT model.

关 键 词:司法人工智能 文本分类 虚拟对抗训练 类别不均衡 特征提取 

分 类 号:TP389.1[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象