基于小数据集的机器学习预测酰胺键合成转化率  

Machine Learning Enables the Prediction of Amide Bond Synthesis Based on Small Datasets

在线阅读下载全文

作  者:李兴海 吴志森 张利静 陶胜洋[1] Xinghai Li;Zhisen Wu;Lijing Zhang;Shengyang Tao(School of Chemistry,State Key Laboratory of Fine Chemicals,Frontier Science Center for Smart Materials,Dalian Key Laboratory of Intelligent Chemistry,Dalian University of Technology,Dalian 116024,Liaoning Province,China)

机构地区:[1]大连理工大学化学学院,精细化工国家重点实验室,智能材料化工前沿科学中心,大连市智能化学重点实验室,辽宁大连116024

出  处:《物理化学学报》2025年第2期81-89,共9页Acta Physico-Chimica Sinica

基  金:国家自然科学基金(22072011,22372025,22211530456);中央高校基本科研业务费(DUT22LAB607,DUT22QN226);中国航空研究院1912项目资助。

摘  要:机器学习(ML)在分子合成领域显示了重要的应用前景。然而,准确的机器学习预测依赖于大量实验数据,而通过传统实验方法获得成千上万的实验数据仍然是一个巨大的挑战。因此,基于小数据集得到可接受的预测模型是目前该领域亟待解决的重要问题。本研究通过构建1152个反应数据,利用大量有化学意义的特征描述符,通过多维数据分析获得了有效的预测结果,证明了基于小数据集的机器学习算法可以可靠地预测酰胺键合成反应的转化率。研究比较了6种机器学习算法的预测精度,其中随机森林表现出卓越的预测性能(R^(2)>0.95)。同时,在预测未知芳胺分子的转化率时,研究发现在训练集中加入少量未知分子的相关反应数据,即使数据集较小,也能显著提升对未知分子转化率的预测准确性,揭示了一种利用小数据集得到较好预测结果的方法。本研究为小数据集下的机器学习辅助化学合成研究提供了参考价值。不久的将来,机器学习将有力地推动有机合成化学的智能化发展。Machine learning(ML)is progressively revealing notable advantages in chemical synthesis.However,the limited output of experimental data from traditional methods poses a bottleneck,impeding the widespread adoption of machine learning.Data from literature often leads to overly optimistic predictions,and obtaining thousands of experimental data points through experiments remains a substantial challenge.Using a small dataset of experimental data,we illustrated that machine learning algorithms can reliably predict the conversion rate of amide bond synthesis.We gathered hundreds of experimental data points for 9 aromatic amines and 12 organic acids using various coupling reagents and solvents in a 96-well plate high-throughput experimental setup.Subsequently,we derived 76 feature molecular descriptors from quantum chemical calculations and utilized them as inputs for training the machine learning model.Despite the inherent limitation of low data volume,the random forest algorithm demonstrated outstanding predictive performance(R^(2)>0.95).Through comprehensive analysis of the reaction process employing importance analysis,shapley additive explanations(SHAP),and accumulated local effects(ALE)methods,we delved into the important factors influencing the reaction conversion rate.In predicting the conversion rate of unknown aromatic amine molecules,we discovered that incorporating a small amount of unknown molecule-related reaction data into the training set effectively enhances the model’s predictive performance,even with a small dataset.By comparing models trained on different molecular descriptors such as density functional theory(DFT)and one-hot encoding,we validated the efficacy of adjusting the training set to improve prediction results.This study utilized a multitude of chemically meaningful feature descriptors and achieved more effective prediction results through multidimensional data analysis,offering valuable insights for machine learning-assisted chemical synthesis research in small datasets.In the near future

关 键 词:酰胺键合成 机器学习 特征描述符 随机森林算法 小数据集 

分 类 号:O643[理学—物理化学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象