Urdu Topic Modeling Using Transformer-Based Attention Networks

Sabia Khan, 01-241211-008

DSpace Home
→
Thesis/Dissertation Repository Engineering School Islamabad
→
Department of Software Engineering (BUES)
→
MS(SE) (BUES)
→
View Item

dc.contributor.author	Sabia Khan, 01-241211-008
dc.date.accessioned	2023-09-08T11:01:36Z
dc.date.available	2023-09-08T11:01:36Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/123456789/16158
dc.description	Supervised by Dr. Raja Muhammad Suleman	en_US
dc.description.abstract	Urdu, the national language of Pakistan and the one of the most widely spoken languages of the Indian sub-continent, is considered a Low-Resourced Language owing to the lack of digital resources available. There are different tasks that can be performed on a language using Natural Language Processing (NLP) which can help automate the understanding and generation of text in these languages. Topic Modeling is one such task that aims at discovering Topics (themes of discussion) within unstructured text. For Topic Modeling most of the researchers have focused on Latent Dirichlet Allocation (LDA) which is a statistical topic modeling technique to generate topics for Urdu language. Such techniques are quite useful for low-resourced languages as they require less amounts of data to train such models. However, Transformer-based models have become the recent state-of-the-art for many NLP tasks including Topic Modeling. The transformer-based modeling techniques have seen wide adoption because of the availability of a large number of pre-trained multi-lingual models. To the best of our knowledge, no research has exploited the benefit of using these Transformer-based models to perform topic modeling for Urdu language. Through this research we analyze and compare two Topic Modeling techniques LDA and Transformer-based (BERT multilingual) on the basis of their performance, coherence scores and topic generation. Our results show that transformer-based models return a higher coherence score than the LDA model which means that the topics generated through such models are more interpretable by humans.i	en_US
dc.language.iso	en	en_US
dc.publisher	Software Engineering, Bahria University Engineering School Islamabad	en_US
dc.relation.ispartofseries	MS(SE);T-2381
dc.subject	Software Engineering	en_US
dc.subject	LDA steps	en_US
dc.subject	Similarity matrix BERT vs LDA	en_US
dc.title	Urdu Topic Modeling Using Transformer-Based Attention Networks	en_US
dc.type	MS Thesis	en_US