基于Spark框架的改进协同过滤算法  被引量:1

Improved Collaborative Filtering Algorithm Based on Spark

在线阅读下载全文

作  者:邹红旭 潘冠华[1] 李吟[1] ZOU Hong-xu;PAN Guan-hua;LI Yin(Jiangsu Automation Research Institute of CSIC,Lianyungang 222006,China)

机构地区:[1]江苏自动化研究所,江苏连云港222006

出  处:《计算机技术与发展》2020年第5期38-42,共5页Computer Technology and Development

基  金:国家自然科学基金(61773384)。

摘  要:随着互联网数据量的不断膨胀,单机已经无法在可接受的时间范围内计算完基于大规模数据的推荐算法,也无法存放海量的数据。利用Spark平台内存计算的优点,设计了一种分布式的基于项目的协同过滤算法,利用Spark提供的RDD(resilient distributed dataset)算子完成算法的设计。针对由于数据稀疏而导致的相似度计算不准确的问题,提出了一种利用两项目间公共用户数目进行加权的相似度计算公式,提高了最终推荐结果的准确度。为了改善计算中涉及到的数据表等值连接操作耗时太长的问题,利用自定义的Hash_join函数替代Spark自带的连接操作算子,提高了计算效率。采用UCI的公用数据集MovieLens对算法进行测试,并分别与改进前的算法以及单机运行的算法进行对比,结果表明,改进的算法在准确度和效率方面都有更好的表现。With the explosive growth of data,single-computer computing has been unable to meet the real-time requirements of recommendation algorithms,nor can it store massive data. A distributed item-based collaborative filtering algorithm is designed based on the advantages of memory computing in Spark platform,and the RDD(resilient distributed dataset) provided by Spark is used to complete the design of the algorithm. To solve the problem of inaccurate similarity caused by sparse data,a similarity calculation formula weighted by the number of common users between two items is proposed,which improves the accuracy of the final recommendation results. Equivalent connection of data tables is involved in the calculation. In order to reduce the time consumed by equivalent connection of data tables,the user-defined Hash_join function is used to improve the calculated performance. The performance of the algorithm based on Spark platform is tested by MovieLens dataset.Compared with the original algorithm and the one running on a single computer respectively,it is showed that the improved algorithm has better performance in accuracy and efficiency.

关 键 词:协同过滤 SPARK 稀疏数 相似度计 等值连接 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象