Optical character recognition system for urdu words in Nastaliq Font (T-0688) (MFN 4017)

Safia Shabbir, 01-244112-023

DSpace Home
→
Thesis/Dissertation Repository Engineering School Islamabad
→
Department of Software Engineering (BUES)
→
MS(SE) (BUES)
→
View Item

dc.contributor.author	Safia Shabbir, 01-244112-023
dc.date.accessioned	2017-07-27T06:19:11Z
dc.date.available	2017-07-27T06:19:11Z
dc.date.issued	2014
dc.identifier.uri	http://hdl.handle.net/123456789/3085
dc.description	Supervised by Dr. Imran Ahmed Siddiqi	en_US
dc.description.abstract	Optical Character Recognition (OCR) has been an attractive research area for the last three decades and mature OCR systems reporting near to 100% recognition rates are available for many scripts/languages of the world today. Despite these developments, research on recognition of text in many languages is still in its early days, Urdu being one of them. The limited existing literature on Urdu OCR is either limited to isolated characters or considers limited vocabularies in fixed font sizes. This research presents a segmentation free and size invariant technique for recognition of Urdu words in Nastaliq font using ligatures as units of recognition. Connected component labeling is applied to binarized images of Urdu text to extract ligatures which are separated into primary ligatures and diacritics. Ligatures extracted from a set of documents are represented by profile and projection features and grouped into clusters using Dynamic Time Warping (DTW) as the (dis)similarity measure. A total of 250 clusters of frequent Urdu ligatures are considered in our study. These clusters serve as training data to train a separate right-to-left Hidden Markov Model (HMM) for each ligature. Ligatures (main body as well as diacritics) of the query word are recognized by their respective HMMs. Using position information; diacritics are associated with their corresponding ligatures which are then validated by a dictionary. Unicode of the complete word is finally written to a text file. The proposed system evaluated on 100 query words realized promising results at ligature and word level recognition.	en_US
dc.language.iso	en	en_US
dc.publisher	Software Engineering, Bahria University Engineering School Islamabad	en_US
dc.relation.ispartofseries	MS SE;T-0688
dc.subject	Software Engineering	en_US
dc.title	Optical character recognition system for urdu words in Nastaliq Font (T-0688) (MFN 4017)	en_US
dc.type	MS Thesis	en_US