Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.
| dc.contributor.author | Safia Shabbir, 01-244112-023 | |
| dc.date.accessioned | 2017-07-27T06:19:11Z | |
| dc.date.available | 2017-07-27T06:19:11Z | |
| dc.date.issued | 2014 | |
| dc.identifier.uri | http://hdl.handle.net/123456789/3085 | |
| dc.description | Supervised by Dr. Imran Ahmed Siddiqi | en_US |
| dc.description.abstract | Optical Character Recognition (OCR) has been an attractive research area for the last three decades and mature OCR systems reporting near to 100% recognition rates are available for many scripts/languages of the world today. Despite these developments, research on recognition of text in many languages is still in its early days, Urdu being one of them. The limited existing literature on Urdu OCR is either limited to isolated characters or considers limited vocabularies in fixed font sizes. This research presents a segmentation free and size invariant technique for recognition of Urdu words in Nastaliq font using ligatures as units of recognition. Connected component labeling is applied to binarized images of Urdu text to extract ligatures which are separated into primary ligatures and diacritics. Ligatures extracted from a set of documents are represented by profile and projection features and grouped into clusters using Dynamic Time Warping (DTW) as the (dis)similarity measure. A total of 250 clusters of frequent Urdu ligatures are considered in our study. These clusters serve as training data to train a separate right-to-left Hidden Markov Model (HMM) for each ligature. Ligatures (main body as well as diacritics) of the query word are recognized by their respective HMMs. Using position information; diacritics are associated with their corresponding ligatures which are then validated by a dictionary. Unicode of the complete word is finally written to a text file. The proposed system evaluated on 100 query words realized promising results at ligature and word level recognition. | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | Software Engineering, Bahria University Engineering School Islamabad | en_US |
| dc.relation.ispartofseries | MS SE;T-0688 | |
| dc.subject | Software Engineering | en_US |
| dc.title | Optical character recognition system for urdu words in Nastaliq Font (T-0688) (MFN 4017) | en_US |
| dc.type | MS Thesis | en_US |