Abstract:
Automatic text classification due to its various applications in Data Mining and information
technology has gain immense importance. It plays a vital role in various fields i.e. Spam filtering,
News classification, Noise reduction, and much more. Currently, there exists lots of work for
classifying text especially at document level in different available languages i.e. English News
classification, Persian text classification etc. but work related to short Urdu text or Urdu news
headlines classification is not carried out so for. In order to classify Urdu text data, many
preprocessing steps i.e. stop words removal, tokenization, stemming etc. are of prime
consideration. After performing the required pre-processing, desired features are selected, which
are then classified using existing text classification methodologies i.e. SVM, Naive Bayes much
more. In our proposed work, we have developed a system, which classifies Urdu news headlines
to one of the pre-defined classes. A systematic and module based approach is proposed. In the
very first module, we perform basic pre-processing steps using the train data. This comprises of
exploding headlines into segments utilizing tokenization, cleaning data from diacritics and
meaningless words by text sanitization process, removing stop words by using the existing stop
words lists for Urdu language and words stemming by utilizing an existing generic stemming
technique for Urdu language. In the second module, SVM based model learned using feature
vector generated combing all words from each class after deploying threshold value. In the last
and third module, pre-process unseen news headlines, and classifies Urdu headlines of test data to
the pre-defined classes by utilizing the feature vector maximum index. The word with maximum
index value of the feature vector is classifies to that word particular class. Experimental
evaluation and results of our proposed system are presented in tabular form. To prove the
effectiveness of our proposed system, competitor analysis have been by deploying the competitor
system on our self-generated datasets.