Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data  

在线阅读下载全文

作  者:WO Jia ZHANG Chongliang XU Binduo XUE Ying REN Yiping 

机构地区:[1]College of Fisheries,Ocean University of China,Qingdao 266003,China [2]Laboratory for Marine Science and Food Production Processes,National Laboratory for Marine Science and Technology,Qingdao 266237,China

出  处:《Journal of Ocean University of China》2020年第3期659-668,共10页中国海洋大学学报(英文版)

基  金:provided by the Marine S&T Fund of Shandong Province for Pilot National Laboratory for Marine Science and Technology (Qingdao) (No.2018SDKJ0501-2)。

摘  要:Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable interpretation. However, the reliability and stability of the clustering methods have rarely been studied in the contexts of fisheries. This study presents an intensive evaluation of three common clustering methods, including hierarchical clustering(HC), K-means(KM), and expectation-maximization(EM) methods, based on fish community surveys in the coastal waters of Shandong, China. We evaluated the performances of these three methods considering different numbers of clusters, data size, and data transformation approaches, focusing on the consistency validation using the index of average proportion of non-overlap(APN). The results indicate that the three methods tend to be inconsistent in the optimal number of clusters. EM showed relatively better performances to avoid unbalanced classification, whereas HC and KM provided more stable clustering results. Data transformation including scaling, square-root, and log-transformation had substantial influences on the clustering results, especially for KM. Moreover, transformation also influenced clustering stability, wherein scaling tended to provide a stable solution at the same number of clusters. The APN values indicated improved stability with increasing data size, and the effect leveled off over 70 samples in general and most quickly in EM. We conclude that the best clustering method can be chosen depending on the aim of the study and the number of clusters. In general, KM is relatively robust in our tests. We also provide recommendations for future application of clustering analyses. This study is helpful to ensure the credibility of the application and interpretation of clustering methods.

关 键 词:hierarchical cluster K-means cluster expectation-maximization cluster optimal number of clusters stability data transformation 

分 类 号:S932[农业科学—渔业资源]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象