Abstract:
Stemming is one of the most important pre-processing steps in the process of Text Mining which boosts the performance of information retrieval (IR) system. It is also equally important for many other interesting research areas like natural language processing (NLP), text categorization etc. The main objective of stemming is to bring many grammatical word forms, for example parts of speech, gender, tense etc. to their stem or root form. Due to the rich morphological structure of Urdu language, it is a challenging task to develop an Urdu stemmer for information retrieval system. In this paper, we have proposed an effective rule-based stemming method for Urdu language to cope with the challenges of Urdu morphological structure. Our proposed Urdu stemmer generate the stem of Urdu words as well as borrowed words (words from other languages such as Arabic, Persian, Turkish, etc). The proposed methodology is compared with the existing Urdu stemming technique such as Light Weight Stemmer for Urdu Language to demonstrate the dominance of proposed Urdu stemmer as compared to the competitor.