Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.
dc.contributor.author | Sabia Khan, 01-241211-008 | |
dc.date.accessioned | 2023-09-08T11:01:36Z | |
dc.date.available | 2023-09-08T11:01:36Z | |
dc.date.issued | 2023 | |
dc.identifier.uri | http://hdl.handle.net/123456789/16158 | |
dc.description | Supervised by Dr. Raja Muhammad Suleman | en_US |
dc.description.abstract | Urdu, the national language of Pakistan and the one of the most widely spoken languages of the Indian sub-continent, is considered a Low-Resourced Language owing to the lack of digital resources available. There are different tasks that can be performed on a language using Natural Language Processing (NLP) which can help automate the understanding and generation of text in these languages. Topic Modeling is one such task that aims at discovering Topics (themes of discussion) within unstructured text. For Topic Modeling most of the researchers have focused on Latent Dirichlet Allocation (LDA) which is a statistical topic modeling technique to generate topics for Urdu language. Such techniques are quite useful for low-resourced languages as they require less amounts of data to train such models. However, Transformer-based models have become the recent state-of-the-art for many NLP tasks including Topic Modeling. The transformer-based modeling techniques have seen wide adoption because of the availability of a large number of pre-trained multi-lingual models. To the best of our knowledge, no research has exploited the benefit of using these Transformer-based models to perform topic modeling for Urdu language. Through this research we analyze and compare two Topic Modeling techniques LDA and Transformer-based (BERT multilingual) on the basis of their performance, coherence scores and topic generation. Our results show that transformer-based models return a higher coherence score than the LDA model which means that the topics generated through such models are more interpretable by humans.i | en_US |
dc.language.iso | en | en_US |
dc.publisher | Software Engineering, Bahria University Engineering School Islamabad | en_US |
dc.relation.ispartofseries | MS(SE);T-2381 | |
dc.subject | Software Engineering | en_US |
dc.subject | LDA steps | en_US |
dc.subject | Similarity matrix BERT vs LDA | en_US |
dc.title | Urdu Topic Modeling Using Transformer-Based Attention Networks | en_US |
dc.type | MS Thesis | en_US |