Framework for Classification of Urdu News based on their headlines (T-0683) (MFN 4230)

Kashif Ahmed, 01-244121-004

DSpace Home
→
Thesis/Dissertation Repository Engineering School Islamabad
→
Department of Software Engineering (BUES)
→
MS(SE) (BUES)
→
View Item

Framework for Classification of Urdu News based on their headlines (T-0683) (MFN 4230)

Kashif Ahmed, 01-244121-004

URI: http://hdl.handle.net/123456789/2866

Date: 2014

Abstract:

Automatic text classification due to its various applications in Data Mining and information technology has gain immense importance. It plays a vital role in various fields i.e. Spam filtering, News classification, Noise reduction, and much more. Currently, there exists lots of work for classifying text especially at document level in different available languages i.e. English News classification, Persian text classification etc. but work related to short Urdu text or Urdu news headlines classification is not carried out so for. In order to classify Urdu text data, many preprocessing steps i.e. stop words removal, tokenization, stemming etc. are of prime consideration. After performing the required pre-processing, desired features are selected, which are then classified using existing text classification methodologies i.e. SVM, Naive Bayes much more. In our proposed work, we have developed a system, which classifies Urdu news headlines to one of the pre-defined classes. A systematic and module based approach is proposed. In the very first module, we perform basic pre-processing steps using the train data. This comprises of exploding headlines into segments utilizing tokenization, cleaning data from diacritics and meaningless words by text sanitization process, removing stop words by using the existing stop words lists for Urdu language and words stemming by utilizing an existing generic stemming technique for Urdu language. In the second module, SVM based model learned using feature vector generated combing all words from each class after deploying threshold value. In the last and third module, pre-process unseen news headlines, and classifies Urdu headlines of test data to the pre-defined classes by utilizing the feature vector maximum index. The word with maximum index value of the feature vector is classifies to that word particular class. Experimental evaluation and results of our proposed system are presented in tabular form. To prove the effectiveness of our proposed system, competitor analysis have been by deploying the competitor system on our self-generated datasets.