检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Qin Li Nihang Fu Sadman Sadeed Omee Jianjun Hu
机构地区:[1]College of Big Data and Statistics,Guizhou University of Finance and Economics,Guiyang,China [2]Department of Computer SCience and Engineering,University of South Carolina,Columbia,SC,USA
出 处:《npj Computational Materials》2024年第1期638-648,共11页计算材料学(英文)
基 金:supported in part by National Science Foundation under the grant number 2311202.
摘 要:Materials datasets usually contain many redundant(highly similar)materials due to the tinkering approach historically used in material design.This redundancy skews the performance evaluation of machine learning(ML)models when using random splitting,leading to overestimated predictive performance and poor performance on out-of-distribution samples.This issue is well-known in bioinformatics for protein function prediction,where tools like CD-HIT are used to reduce redundancy by ensuring sequence similarity among samples greater than a given threshold.In this paper,we survey the overestimated ML performance in materials science for material property prediction and propose MD-HIT,a redundancy reduction algorithm for material datasets.Applying MD-HIT to composition-and structure-based formation energy and band gap prediction problems,we demonstrate that with redundancy control,the prediction performances of the ML models on test sets tend to have relatively lower performance compared to the model with high redundancy,but better reflect models’true prediction capability.
关 键 词:PREDICTION PROPERTY performance
分 类 号:TG1[金属学及工艺—金属学]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.77