DSpace Repository

Post-Processing and Classification of Urdu News Ticker Text from Raw OCR Output

Show simple item record

dc.contributor.author Ubaid Ur Rahman, 01-243192-013
dc.date.accessioned 2022-01-17T10:37:59Z
dc.date.available 2022-01-17T10:37:59Z
dc.date.issued 2021
dc.identifier.uri http://hdl.handle.net/123456789/11651
dc.description Supervised by Dr. Imran Siddiqi en_US
dc.description.abstract Recognition of text in non-cursive scripts has received significant research attention in the recent years. Thanks to the recent advancements in different areas of deep (machine) learning, robust end-to-end recognition systems have been developed. While optical character recognition (OCR) systems have matured significantly over the years, recognition of cursive text still remains challenging especially in the context of video caption text. While most of the research on this subject targets development of recognition engines, fewer efforts have been made to post-process the noisy output of the OCR to improve the recognition rates. This study targets post-processing of the raw and noisy output of a video OCR in the context of our local News channels. More specifically, we take the output of an Urdu caption text recognizer and propose post-processing techniques to reduce the word error rates. Words in the output of the OCR are segmented using a supervised learning method and the segmented words are validated through a dictionary. The incorrect instances are identified and are corrected using a dictionary-based correction, a language model and by employing the word-to-vector model. Once the text lines are corrected, as a secondary objective, we also categorize the News into one of the pre-defined categories using a long short-term memory (LSTM) based model. The experimental study of the system reveals that introducing a post-processing step reduces the OCR errors from 18% to less than 1%. Likewise, a classification rate of 82% is reported by the LSTM-based model. It is expected that the findings of this study would be useful in developing various applications on top of the OCR for a low resource language like Urdu. en_US
dc.language.iso en en_US
dc.publisher Computer Sciences BUIC en_US
dc.relation.ispartofseries MS (CS);T-9722
dc.subject Post-Processing en_US
dc.subject Classification en_US
dc.subject Urdu News Ticker en_US
dc.title Post-Processing and Classification of Urdu News Ticker Text from Raw OCR Output en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account