Abstract:
Recognition of text in non-cursive scripts has received significant research attention in the recent years. Thanks to the recent advancements in different areas of deep (machine) learning, robust end-to-end recognition systems have been developed. While optical character recognition (OCR) systems have matured significantly over the years, recognition of cursive text still remains challenging especially in the context of video caption text. While most of the research on this subject targets development of recognition engines, fewer efforts have been made to post-process the noisy output of the OCR to improve the recognition rates. This study targets post-processing of the raw and noisy output of a video OCR in the context of our local News channels. More specifically, we take the output of an Urdu caption text recognizer and propose post-processing techniques to reduce the word error rates. Words in the output of the OCR are segmented using a supervised learning method and the segmented words are validated through a dictionary. The incorrect instances are identified and are corrected using a dictionary-based correction, a language model and by employing the word-to-vector model. Once the text lines are corrected, as a secondary objective, we also categorize the News into one of the pre-defined categories using a long short-term memory (LSTM) based model. The experimental study of the system reveals that introducing a post-processing step reduces the OCR errors from 18% to less than 1%. Likewise, a classification rate of 82% is reported by the LSTM-based model. It is expected that the findings of this study would be useful in developing various applications on top of the OCR for a low resource language like Urdu.