Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles.

Sameea Naeem, 01-249201-014

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (DS) (BUIC-E-8)
→
View Item

dc.contributor.author	Sameea Naeem, 01-249201-014
dc.date.accessioned	2022-08-04T05:25:46Z
dc.date.available	2022-08-04T05:25:46Z
dc.date.issued	2022
dc.identifier.uri	http://hdl.handle.net/123456789/13008
dc.description	Supervised by Dr. Arif ur Rahman	en_US
dc.description.abstract	Finding similarities between two inter-language news articles is a challenging problem of Natural Language Processing (NLP). All the major human activities become news and uploaded to the different news platforms in so many different languages. It is difficult to find similar news articles in a different language other than the native language of the person, there is a need for an automatic system that can estimate the similarity between two inter-language news articles. Automatic detection of similarity between two news articles is a difficult task, however, the use of machine learning techniques along with English-Urdu transliterated words can make it easier. For this purpose research propose ML model with the combination of English Urdu word transliteration which will show whether the English news article is similar to the Urdu news article or not. The existing approaches to find similarities has a major drawback when the archives contain articles of low-resourced languages like Urdu along with English news article. The existing approaches to find similarities has drawback when the archives contain low-resourced languages like Urdu along with English news articles. This research uses lexicon to link Urdu and English news articles. A literature review shows that very few researchers worked on Urdu and English news articles so first thing is to make Urdu- English lexicon or dictionary. Second thing is to process Urdu text data as it’s difficult to convert it into word segments so the second thing done is Urdu text tokenization. The main focus of this research is the Urdu-English transliteration system. As Urdu language processing applications like machine translation, text to speech, etc are unable to handle English text at the same time so this research proposed technique to find similarities in English and Urdu news articles based on transliteration.	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences BUIC	en_US
dc.relation.ispartofseries	MS (DS);T-10572
dc.subject	Natural Language Processing	en_US
dc.subject	Human Activities	en_US
dc.title	Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles.	en_US
dc.type	MS Thesis	en_US