Optical character recognition for printed Urdu nastaliq font (T-0010) (Old 8755)

Welcome to DSpace BU Repository

Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.

Show simple item record

dc.contributor.author Israr Uddin, 01-281121-005
dc.date.accessioned 2020-08-15T03:03:04Z
dc.date.available 2020-08-15T03:03:04Z
dc.date.issued 2019
dc.identifier.uri http://hdl.handle.net/123456789/9950
dc.description Supervised by Dr.Imran Siddiqi en_US
dc.description.abstract Optical Character Recognition (OCR) is one of the most investigated pattern classification problems that has deceived remarkable research attention for more than half a century. From the simplest systems recognizing isolated digits to end-to-end recognition systems, applications of OCRs vary from postal mail sorting to reading systems in scene images facilitating autonomous navigation or assisting the visually impaired. Despite tremendous research endeavors and availability of commercial recognition engines for many scripts, recognition of cursive scripts still remains an open and challenging research problem mainly due to the complexity of script, segmentation issues and large number of classes to recognize. Among these, Urdu makes the subject of our study. More specifically, this study investigates the recognition of printed Urdu text in Nastaliq style, the most widely employed script for Urdu text that is more complex than the Naskh style of Arabic. This work presents a holistic (segmentation-free) technique that exploits ligatures (partial words) as units of recognition. Urdu has a total of more than 26,000 unique ligatures, many of the ligatures, however, share the same main body (primary ligature) and differ only in the number and position of dots and diacritics (secondary ligatures). We exploit this idea to separately recognize the primary and secondary ligatures and later re-associate the two to recognize the complete ligature. Recognition is carried out using two techniques; the first of these is based on hand-crafted statistical features using hidden Markov models (HMMs). Features extracted using sliding windows are used to train a separate model for each ligature class. Feature sequences of the query ligature are fed to all the models and recognition is carried out through the model that reports the maximum probability. The second technique employs Convolutional Neural Networks (CNNs) to automatically extract useful feature representations from the classes and recognize the ligatures. We investigated the performance of a number of pre-trained networks using transfer learning techniques and trained our own set of networks from scratch as well. en_US
dc.language.iso en en_US
dc.publisher Bahria University Islamabad Campus en_US
dc.relation.ispartofseries PHD (CE);T-0010
dc.subject Computer Engineering en_US
dc.title Optical character recognition for printed Urdu nastaliq font (T-0010) (Old 8755) en_US
dc.type PhD Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account