Optical character recognition system for urdu words in Nastaliq Font (T-0688) (MFN 4017)

Welcome to DSpace BU Repository

Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.

Show simple item record

dc.contributor.author Safia Shabbir, 01-244112-023
dc.date.accessioned 2017-07-27T06:19:11Z
dc.date.available 2017-07-27T06:19:11Z
dc.date.issued 2014
dc.identifier.uri http://hdl.handle.net/123456789/3085
dc.description Supervised by Dr. Imran Ahmed Siddiqi en_US
dc.description.abstract Optical Character Recognition (OCR) has been an attractive research area for the last three decades and mature OCR systems reporting near to 100% recognition rates are available for many scripts/languages of the world today. Despite these developments, research on recognition of text in many languages is still in its early days, Urdu being one of them. The limited existing literature on Urdu OCR is either limited to isolated characters or considers limited vocabularies in fixed font sizes. This research presents a segmentation free and size invariant technique for recognition of Urdu words in Nastaliq font using ligatures as units of recognition. Connected component labeling is applied to binarized images of Urdu text to extract ligatures which are separated into primary ligatures and diacritics. Ligatures extracted from a set of documents are represented by profile and projection features and grouped into clusters using Dynamic Time Warping (DTW) as the (dis)similarity measure. A total of 250 clusters of frequent Urdu ligatures are considered in our study. These clusters serve as training data to train a separate right-to-left Hidden Markov Model (HMM) for each ligature. Ligatures (main body as well as diacritics) of the query word are recognized by their respective HMMs. Using position information; diacritics are associated with their corresponding ligatures which are then validated by a dictionary. Unicode of the complete word is finally written to a text file. The proposed system evaluated on 100 query words realized promising results at ligature and word level recognition. en_US
dc.language.iso en en_US
dc.publisher Software Engineering, Bahria University Engineering School Islamabad en_US
dc.relation.ispartofseries MS SE;T-0688
dc.subject Software Engineering en_US
dc.title Optical character recognition system for urdu words in Nastaliq Font (T-0688) (MFN 4017) en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account