Abstract:
Textual content appearing in videos represents an interesting index for semantic retrieval of videos (from archives), generation of alerts (live streams) as well as high level applications like opinion mining and content summarization. Key components of a textual content based retrieval system include detection (localization) of text regions and recognition of text through Video Optical Character Recognition (V-OCR) systems. While mature detection and recognition systems are available for text in non-cursive scripts, research on cursive scripts (like Urdu) is fairly limited and is marked by many challenges. These include complex and overlapping ligatures, context-dependent shape variations and presence of a large number of dots and diacritics. This research aims at detection and recognition of artificial (caption) Urdu text appearing in video frames, primarily targeting the local News channels. Leveraging the recent advancements in deep neural networks (DNN), we propose robust techniques to detect and recognize Urdu caption text from frames with bilingual (English & Urdu) textual content, the most common scenario in majority of our News channels. Detection of textual content relies on adapting the deep convolutional neural networks(CNN) based object detectors for text localization. To cater multiple scripts, text detection and script identification are combined in a single end-to-end trainable system. For recognition, we employ an implicit segmentation based analytical technique that relies on a combination of a CNN and recurrent neural network (RNN) with a connectionist temporal classification (CTC) layer. Images of text lines extracted from video frames along with ground truth transcription are fed to the CNN for feature extraction. The extracted feature sequences are then employed by the recurrent part of the network to predict the likely sequence of characters. Finally, the CTC layer converts raw predictions into meaningful Urdu text.