Abstract:
Finding similarities between two inter-language news articles is a challenging problem of Natural Language Processing (NLP). All the major human activities become news and uploaded to the different news platforms in so many different languages. It is difficult to find similar news articles in a different language other than the native language of the person, there is a need for an automatic system that can estimate the similarity between two inter-language news articles. Automatic detection of similarity between two news articles is a difficult task, however, the use of machine learning techniques along with English-Urdu transliterated words can make it easier. For this purpose research propose ML model with the combination of English Urdu word transliteration which will show whether the English news article is similar to the Urdu news article or not. The existing approaches to find similarities has a major drawback when the archives contain articles of low-resourced languages like Urdu along with English news article. The existing approaches to find similarities has drawback when the archives contain low-resourced languages like Urdu along with English news articles. This research uses lexicon to link Urdu and English news articles. A literature review shows that very few researchers worked on Urdu and English news articles so first thing is to make Urdu- English lexicon or dictionary. Second thing is to process Urdu text data as it’s difficult to convert it into word segments so the second thing done is Urdu text tokenization. The main focus of this research is the Urdu-English transliteration system. As Urdu language processing applications like machine translation, text to speech, etc are unable to handle English text at the same time so this research proposed technique to find similarities in English and Urdu news articles based on transliteration.