Abstract:
Development of OCR system for Urdu language has been much challenging task for
Urdu researchers for last few years. Intensive complex behavior of Urdu language system
is one of prime reason. Urdu images are difficult to understand or manipulate properly
unlike English. Retrieving text, sorting out diacritics, and more other functionalities are
almost becomes impossible, until or unless they do not have satisfactory domain
knowledge of the concerned field. In view of research limitations, proposed work in
existing area, presents segmentation free approach using ligature base recognition for
various fonts size and different writing style of Urdu. Binary image of Urdu text
separates into individual lines. By using connected component labeling on segmented
lines extracted ligature along with diacritics. After extraction of ligatures and diacritics,
diacritics connected with their respective ligature and then these associated ligatures
consider as basic recognition unit. Total 2017 clusters are used in our research; half of
them serve as training data and remaining treated as test data. Discrete Fourier Transform
(DFT) extracted feature vectors for data set. K-Nearest Neighbor was used to find closest
node to query ligature. Our Propose system handled five type of diacritics i.e. different
number and position of dots, hamza( ء), toay( ط), diacritics connected with haey( ہا ) and
gaaf( گ). The proposed system evaluated on 70595 most commonly used ligatures of
Urdu script and found system is able to recognize Urdu ligature with accuracy rate
98.6%.