检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:翟东海[1] 侯佳林[1] 刘月[1] ZHAI Donghai;HOU Jialin;LIU Yue(School of Information and Science and Technology,Southwest Jiaotong University,Chengdu 611756,China)
机构地区:[1]西南交通大学信息科学与技术学院
出 处:《西南交通大学学报》2019年第3期647-654,共8页Journal of Southwest Jiaotong University
基 金:国家自然科学基金资助项目(61540060);科技部国家软科学研究计划资助项目(2013GXS4D150);教育部科学技术研究重点项目(212167)
摘 要:在训练集和测试集数据量大的情况下,半监督递归自编码(semi-supervised recursive auto encoder,Semi-Supervised RAE)文本情感分析模型会出现网络训练速度缓慢和模型的测试结果输出速率缓慢等问题.因此,提出采用并行化处理框架,在大训练集情况下,基于“分而治之”的方法,先将数据集进行分块划分并将各个数据块输入 Map 节点计算每个数据块的误差,利用缓冲区汇总所有的块误差,Reduce 节点从缓冲区读取这些块误差以计算优化目标函数;然后,调用 L-BFGS (limited-memory Broyden-Fletcher-Goldfarb-Shanno)算法调整参数,更新后的参数集再次加载到模型中,重复以上训练步骤逐步优化目标函数直至收敛,从而得到最优参数集;在测试集大的情况下,模型的初始化参数为上述步骤得到的参数集,Map 节点对各句子进行编码得到其向量表示,然后暂存在缓冲区中;最后,在 Reduce 节点中分类器利用各语句的向量表示计算各自语句的情感标签.实例验证表明:在标准语料库 MR (movie review)下本文算法精确度为 77.0%,与原始算法的精确度(77.3%)几乎相同;在大数据量训练集下,训练时间在一定程度上随着计算节点的增加而大量减少.In the case of big training set and test set, based on semi-supervised auto encoder (Semi-Supervised RAE),the text sentiment analysis algorithm is accompanied by slow training rate and output rate of test results. To solve these problems, the corresponding parallel algorithms are proposed in this paper. For the big training data set, the method of “separate operation ” is adopted to divide the data set into blocks. Each data block is inputted into Map nodes to calculate its error,and the errors of all data blocks are stored in the buffer. The block errors are read by Reduce nodes from the buffer to calculate the optimization objective function. Then, the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm is called to update the parameter set, and the updated parameter set is reloaded into the cluster. The above process is iterated until the optimization objective function converges;therefore, an optimal parameter set is obtained. For the big test data set, the parameter set obtained by the above steps is used to initialize the cluster. The vector representation of each sentence is calculated in Map nodes and temporarily stored in the buffer. Then, the sentiment label of each sentence is calculated by the classifier in the Reduce node using the vector representation. The experimental results demonstrate that in the standard MR (movie review) corpus,the accuracy of the algorithm is 77.0%,which is almost the same as the accuracy of the original algorithm (77.3%), at the same time the training time is decreased greatly along with the increase of compute nodes in the massive training data sets.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.97