Abstract:
Clustering is a pivotal step in any Optical Character Recognition (OCR) or Word Spotting system. It serves as a base for the classification and indexing of different words or characters depending upon the level of segmentation. Various clustering methodologies have been applied by different researchers on Latin script based document images. However for Urdu language, which belongs to the family of Arabic and Persian, clustering based indexing systems have not been extensively researched. In this paper, we present a comprehensive study of various known clustering techniques applied on printed Urdu Document Images. The images are segmented into ligatures or partial words and then they are grouped together using different clustering methods. Performance of these methods is evaluated using Calinski-Harabasz, Davis-Bouldin and Dunn indexes.