Post-Processing and Classification of Urdu News Ticker Text from Raw OCR Output

Ubaid Ur Rahman, 01-243192-013

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (CS) (BUIC-E-8)
→
View Item

dc.contributor.author	Ubaid Ur Rahman, 01-243192-013
dc.date.accessioned	2022-01-17T10:37:59Z
dc.date.available	2022-01-17T10:37:59Z
dc.date.issued	2021
dc.identifier.uri	http://hdl.handle.net/123456789/11651
dc.description	Supervised by Dr. Imran Siddiqi	en_US
dc.description.abstract	Recognition of text in non-cursive scripts has received significant research attention in the recent years. Thanks to the recent advancements in different areas of deep (machine) learning, robust end-to-end recognition systems have been developed. While optical character recognition (OCR) systems have matured significantly over the years, recognition of cursive text still remains challenging especially in the context of video caption text. While most of the research on this subject targets development of recognition engines, fewer efforts have been made to post-process the noisy output of the OCR to improve the recognition rates. This study targets post-processing of the raw and noisy output of a video OCR in the context of our local News channels. More specifically, we take the output of an Urdu caption text recognizer and propose post-processing techniques to reduce the word error rates. Words in the output of the OCR are segmented using a supervised learning method and the segmented words are validated through a dictionary. The incorrect instances are identified and are corrected using a dictionary-based correction, a language model and by employing the word-to-vector model. Once the text lines are corrected, as a secondary objective, we also categorize the News into one of the pre-defined categories using a long short-term memory (LSTM) based model. The experimental study of the system reveals that introducing a post-processing step reduces the OCR errors from 18% to less than 1%. Likewise, a classification rate of 82% is reported by the LSTM-based model. It is expected that the findings of this study would be useful in developing various applications on top of the OCR for a low resource language like Urdu.	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences BUIC	en_US
dc.relation.ispartofseries	MS (CS);T-9722
dc.subject	Post-Processing	en_US
dc.subject	Classification	en_US
dc.subject	Urdu News Ticker	en_US
dc.title	Post-Processing and Classification of Urdu News Ticker Text from Raw OCR Output	en_US
dc.type	MS Thesis	en_US