Abstract:
Optical Character Recognition (OCR) has been an attractive research area for the last three
decades and mature OCR systems reporting near to 100% recognition rates are available for
many scripts/languages of the world today. Despite these developments, research on recognition
of text in many languages is still in its early days, Urdu being one of them. The limited existing
literature on Urdu OCR is either limited to isolated characters or considers limited vocabularies
in fixed font sizes. This research presents a segmentation free and size invariant technique for
recognition of Urdu words in Nastaliq font using ligatures as units of recognition. Connected
component labeling is applied to binarized images of Urdu text to extract ligatures which are
separated into primary ligatures and diacritics. Ligatures extracted from a set of documents are
represented by profile and projection features and grouped into clusters using Dynamic Time
Warping (DTW) as the (dis)similarity measure. A total of 250 clusters of frequent Urdu ligatures
are considered in our study. These clusters serve as training data to train a separate right-to-left
Hidden Markov Model (HMM) for each ligature. Ligatures (main body as well as diacritics) of
the query word are recognized by their respective HMMs. Using position information; diacritics
are associated with their corresponding ligatures which are then validated by a dictionary.
Unicode of the complete word is finally written to a text file. The proposed system evaluated on
100 query words realized promising results at ligature and word level recognition.