加快建设人工智能大模型中文训练数据语料库  被引量:8

Accelerate the Construction of Chinese Training Data Corpus of AI Large Models

在线阅读下载全文

作  者:张凌寒 Zhang Linghan

机构地区:[1]中国政法大学数据法治研究院

出  处:《学术前沿》2024年第13期57-71,共15页Frontiers

基  金:国家社会科学基金重点项目“生成式人工智能的法律定位与分层治理研究”的阶段性研究成果,项目编号:23AFX009

摘  要:人工智能大模型产业发展的三要素为算法、算力与数据,其中训练数据语料库的质量直接决定了人工智能大模型的能力。中文数据语料总量相较英文数据语料严重不足,同时存在数据采集行为违法风险较高、公共数据开放利用不足、线下结构化数据版权制度不协调、商业采购与合作数据无法确定数据权属等障碍,其已成为制约人工智能发展的制度瓶颈。发展我国人工智能大模型产业可通过司法判例明确网络数据来源合法性认定条件,协调版权规则确定线下数据使用合理性制度边界,构建开放机制满足公共数据参与语料库建设需求,协同促进跨领域数据流通交易规则建立供给激励,多方破除制度障碍以应对产业发展需求。The three elements of the development of AI large model industry are algorithm,computing power and data,among which the quality of training data corpus directly determines the ability of AI large models.The total amount of Chinese data corpus is seriously insufficient compared with English data corpus,and there are obstacles such as high risk of illegal data collection,insufficient open utilization of public data,uncoordinated copyright system of offline structured data,and no determined data ownership of commercial procurement and cooperation data,which have become institutional bottlenecks restricting the development of artificial intelligence.The development of China's AI large model industry can clarify the conditions for the identification of the legitimacy of network data sources through judicial precedents,coordinate copyright rules to determine the institutional boundaries of the rationality of offline data use,build an open mechanism to meet the needs of public data participation in corpus construction,coordinately promote the establishment of supply incentives for cross-domain data circulation and transaction rules,and break institutional barriers to meet the needs of industrial development.

关 键 词:人工智能大模型 训练数据 语料库建设 版权制度 公共数据 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象