Text Summarization for Roman Urdu

Laraib Kaleem, 01-249212-005

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (DS) (BUIC-E-8)
→
View Item

dc.contributor.author	Laraib Kaleem, 01-249212-005
dc.date.accessioned	2023-12-18T11:05:24Z
dc.date.available	2023-12-18T11:05:24Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/123456789/16832
dc.description	Supervised by Dr. Arif-Ur-Rahman	en_US
dc.description.abstract	Recent research has shown that multilingual languages are used in roman form over generations. Due to this complex challenge, we are working on a Roman Urdu (RU) in terms of Abstractive Text Summarization (ATS). Roman Urdu (RU) is gathered from news articles. This paper restricts ground truth for Roman-Urdu summaries. Therefore, we used two ways to achieve different tactics. The first was a manual approach to transliterating the dataset into Roman Urdu (RU) by using tools, and for achieving baseline, we approached Google Bard to generate baseline summaries. After that, evaluate the outcomes. The second approach uses transformbased models T5-small and Bert-base-uncased with fine-tuned pretrained models for State-of-the-Art (SOTA) summarization models. For performance evaluation, there are three ways we explored, such as finding similarity to generate baseline results and using the feature extraction Term Frequency-Inverse Document Frequency (TF-IDF) technique to identify performance. And for Natural Language Processing (NLP) phases, we are using tokenization, then punctuation, and after that, loanwords are converted into the desired format to use in the models. However, as a predicted model, accuracy is not the best approach to evaluate, so for this purpose, we also identify intrinsic 1 and extrinsic 2 evaluations to find out the predicted fallout and also identify the model’s training and testing losses. Keywords: Baseline, Roman Urdu (RU), Natural Language Processing (NLP) , Abstractive Text Summarization (ATS), State-of-the-Art (SOTA).	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences	en_US
dc.relation.ispartofseries	MS (DS);T-1106
dc.subject	Text	en_US
dc.subject	Summarization	en_US
dc.subject	Roman Urdu	en_US
dc.title	Text Summarization for Roman Urdu	en_US
dc.type	Thesis	en_US