Offline Optical Character Recognition for Urdu Script (T-0681) (MFN 4238)

Ayesha Rafiq, 01-244121-002

DSpace Home
→
Thesis/Dissertation Repository Engineering School Islamabad
→
Department of Software Engineering (BUES)
→
MS(SE) (BUES)
→
View Item

Offline Optical Character Recognition for Urdu Script (T-0681) (MFN 4238)

Ayesha Rafiq, 01-244121-002

URI: http://hdl.handle.net/123456789/2859

Date: 2014

Abstract:

Development of OCR system for Urdu language has been much challenging task for Urdu researchers for last few years. Intensive complex behavior of Urdu language system is one of prime reason. Urdu images are difficult to understand or manipulate properly unlike English. Retrieving text, sorting out diacritics, and more other functionalities are almost becomes impossible, until or unless they do not have satisfactory domain knowledge of the concerned field. In view of research limitations, proposed work in existing area, presents segmentation free approach using ligature base recognition for various fonts size and different writing style of Urdu. Binary image of Urdu text separates into individual lines. By using connected component labeling on segmented lines extracted ligature along with diacritics. After extraction of ligatures and diacritics, diacritics connected with their respective ligature and then these associated ligatures consider as basic recognition unit. Total 2017 clusters are used in our research; half of them serve as training data and remaining treated as test data. Discrete Fourier Transform (DFT) extracted feature vectors for data set. K-Nearest Neighbor was used to find closest node to query ligature. Our Propose system handled five type of diacritics i.e. different number and position of dots, hamza( ء), toay( ط), diacritics connected with haey( ہا ) and gaaf( گ). The proposed system evaluated on 70595 most commonly used ligatures of Urdu script and found system is able to recognize Urdu ligature with accuracy rate 98.6%.