Urdu Topic Modeling Using Transformer-Based Attention Networks

Welcome to DSpace BU Repository

Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.

Show simple item record

dc.contributor.author Sabia Khan, 01-241211-008
dc.date.accessioned 2023-09-08T11:01:36Z
dc.date.available 2023-09-08T11:01:36Z
dc.date.issued 2023
dc.identifier.uri http://hdl.handle.net/123456789/16158
dc.description Supervised by Dr. Raja Muhammad Suleman en_US
dc.description.abstract Urdu, the national language of Pakistan and the one of the most widely spoken languages of the Indian sub-continent, is considered a Low-Resourced Language owing to the lack of digital resources available. There are different tasks that can be performed on a language using Natural Language Processing (NLP) which can help automate the understanding and generation of text in these languages. Topic Modeling is one such task that aims at discovering Topics (themes of discussion) within unstructured text. For Topic Modeling most of the researchers have focused on Latent Dirichlet Allocation (LDA) which is a statistical topic modeling technique to generate topics for Urdu language. Such techniques are quite useful for low-resourced languages as they require less amounts of data to train such models. However, Transformer-based models have become the recent state-of-the-art for many NLP tasks including Topic Modeling. The transformer-based modeling techniques have seen wide adoption because of the availability of a large number of pre-trained multi-lingual models. To the best of our knowledge, no research has exploited the benefit of using these Transformer-based models to perform topic modeling for Urdu language. Through this research we analyze and compare two Topic Modeling techniques LDA and Transformer-based (BERT multilingual) on the basis of their performance, coherence scores and topic generation. Our results show that transformer-based models return a higher coherence score than the LDA model which means that the topics generated through such models are more interpretable by humans.i en_US
dc.language.iso en en_US
dc.publisher Software Engineering, Bahria University Engineering School Islamabad en_US
dc.relation.ispartofseries MS(SE);T-2381
dc.subject Software Engineering en_US
dc.subject LDA steps en_US
dc.subject Similarity matrix BERT vs LDA en_US
dc.title Urdu Topic Modeling Using Transformer-Based Attention Networks en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account