Roman Urdu News Headline Classification Empowered with Machine Learning  被引量:2

在线阅读下载全文

作  者:Rizwan Ali Naqvi Muhammad Adnan Khan Nauman Malik Shazia Saqib Tahir Alyas Dildar Hussain 

机构地区:[1]Department of Unmanned Vehicle Engineering,Sejong University,Seoul,05006,Korea [2]Department of Computer Science,Lahore Garrison University,Lahore,54000,Pakistan [3]School of Computational Sciences,Korea Institute for Advanced Study,Seoul,02455,Korea

出  处:《Computers, Materials & Continua》2020年第11期1221-1236,共16页计算机、材料和连续体(英文)

基  金:This work is supported by the KIAS(Research Number:CG076601)and in part by Sejong University Faculty Research Fund.

摘  要:Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing.The communication using the Roman characters,which are used in the script of Urdu language on social media,is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply.English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past.This is due to the numerous complexities involved in the processing of Roman Urdu data.The complexities associated with Roman Urdu include the non-availability of the tagged corpus,lack of a set of rules,and lack of standardized spellings.A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook,Twitter but meaningful information can only be extracted if data is in a structured format.We have developed a Roman Urdu news headline classifier,which will help to classify news into relevant categories on which further analysis and modeling can be done.The author of this research aims to develop the Roman Urdu news classifier,which will classify the news into five categories(health,business,technology,sports,international).First,we will develop the news dataset using scraping tools and then after preprocessing,we will compare the results of different machine learning algorithms like Logistic Regression(LR),Multinomial Naïve Bayes(MNB),Long short term memory(LSTM),and Convolutional Neural Network(CNN).After this,we will use a phonetic algorithm to control lexical variation and test news from different websites.The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news.After applying above mentioned different machine learning algorithms,results have shown that Multinomia

关 键 词:Roman urdu news headline classification long short term memory recurrent neural network logistic regression multinomial naïve Bayes random forest k neighbor gradient boosting classifier 

分 类 号:H31[语言文字—英语]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象