Text Summarization for Roman Urdu

Welcome to DSpace BU Repository

Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.

Show simple item record

dc.contributor.author Laraib Kaleem, 01-249212-005
dc.date.accessioned 2023-12-18T11:05:24Z
dc.date.available 2023-12-18T11:05:24Z
dc.date.issued 2023
dc.identifier.uri http://hdl.handle.net/123456789/16832
dc.description Supervised by Dr. Arif-Ur-Rahman en_US
dc.description.abstract Recent research has shown that multilingual languages are used in roman form over generations. Due to this complex challenge, we are working on a Roman Urdu (RU) in terms of Abstractive Text Summarization (ATS). Roman Urdu (RU) is gathered from news articles. This paper restricts ground truth for Roman-Urdu summaries. Therefore, we used two ways to achieve different tactics. The first was a manual approach to transliterating the dataset into Roman Urdu (RU) by using tools, and for achieving baseline, we approached Google Bard to generate baseline summaries. After that, evaluate the outcomes. The second approach uses transformbased models T5-small and Bert-base-uncased with fine-tuned pretrained models for State-of-the-Art (SOTA) summarization models. For performance evaluation, there are three ways we explored, such as finding similarity to generate baseline results and using the feature extraction Term Frequency-Inverse Document Frequency (TF-IDF) technique to identify performance. And for Natural Language Processing (NLP) phases, we are using tokenization, then punctuation, and after that, loanwords are converted into the desired format to use in the models. However, as a predicted model, accuracy is not the best approach to evaluate, so for this purpose, we also identify intrinsic 1 and extrinsic 2 evaluations to find out the predicted fallout and also identify the model’s training and testing losses. Keywords: Baseline, Roman Urdu (RU), Natural Language Processing (NLP) , Abstractive Text Summarization (ATS), State-of-the-Art (SOTA). en_US
dc.language.iso en en_US
dc.publisher Computer Sciences en_US
dc.relation.ispartofseries MS (DS);T-02067
dc.subject Text en_US
dc.subject Summarization en_US
dc.subject Roman Urdu en_US
dc.title Text Summarization for Roman Urdu en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account