Information Retrieval System for Digitized Urdu Documents (T-0680) (MFN 4242)

Zakir Hussain, 01-244112-027

DSpace Home
→
Thesis/Dissertation Repository Engineering School Islamabad
→
Department of Software Engineering (BUES)
→
MS(SE) (BUES)
→
View Item

dc.contributor.author	Zakir Hussain, 01-244112-027
dc.date.accessioned	2017-07-20T07:17:42Z
dc.date.available	2017-07-20T07:17:42Z
dc.date.issued	2014
dc.identifier.uri	http://hdl.handle.net/123456789/2857
dc.description	Supervised by Dr. Imran Siddiqi	en_US
dc.description.abstract	The amount of digital information around us has witnessed a remarkable growth during the last two decades and almost every type of information can be accessed within a span of few clicks. Like other sources, paper documents have also been digitized facilitating rapid access to the readers. This digitization of documents and books is only effective if it is complemented by a search mechanism allowing users retrieve the desired content. This led to a tremendous research in Optical Character Recognition (OCR) systems which convert document images into text allowing search and retrieval facility. Although OCR has been an established research area for many years, for many scripts, OCR systems are either non-existent or are in early days of research. In some cases, recognition of text is very challenging due to complexity of the script. To address these issues, word spotting has emerged as an attractive alternative to traditional OCR systems. Word spotting allows retrieving the documents containing occurrences of the provided query word by matching the shape of words without any knowledge on the semantics. This work presents a word spotting based indexing and retrieval system for digitized Urdu documents. The document image with Urdu text is segmented into ligatures and each ligature is represented by a set of features. Clustering of ligatures is then carried out to group ligatures into clusters and an artificial neural network is trained to learn to discriminate between different ligature classes. For indexing, a document is segmented into ligatures and each ligature is classified into one of the ligature classes. An index file is maintained for each cluster which stores all the occurrences (locations) of the ligature in a given document. During the retrieval phase, a query word presented to the system is segmented into ligatures and each ligature is matched with the existing clusters. For each ligature in the query word, the documents containing the occurrences of the ligature are retrieved using the index file. Finally, the ligatures are merged into words and the retrieved documents are presented to the user. The developed system was used to index 35 Urdu documents having more than 7000 ligatures. Evaluations carried out on a total of 100 query words reported a precision of about 87% and a recall of 93%.	en_US
dc.language.iso	en	en_US
dc.publisher	Software Engineering, Bahria University Engineering School Islamabad	en_US
dc.relation.ispartofseries	MS SE;T-0680
dc.subject	Software Engineering	en_US
dc.title	Information Retrieval System for Digitized Urdu Documents (T-0680) (MFN 4242)	en_US
dc.type	MS Thesis	en_US