Abstract:
From the last few years, researchers are very much attracted to sentiment analysis and
especially towards hate speech detection because in other different languages
procreation of hate speech has compelling and symbolic consideration on social media.
Hate speech has a great impact on society, using hate words harms others dignity. Hate
speech detection is important to stop the transformation of hate words into crimes. In this
research, we have developed a framework for hate speech detection in the Pashto
language. A corpus is created for which data is collected from Twitter. Because there is
no related data available. Most of the research work has been done in this domain for
other languages, and it’s very mature in the context of detecting hate speech. But when
it arrives at the morphological languages not much work has been done especially in the
Pashto language.
In this research, we have aimed and collected data from Twitter, Tweets related to
ethnicity and religion. The data collected from twitter has been annotated manually and
we have categorized the data as hate or not by comparing it with the offensive content.
For hate speech to view the impact of different features/attribute we have performed
experiments on the existing classifiers i.e. SVM, Naïve Bayes, Decision tree and KNN.
SVM produced the highest result at dataset of 500 i.e. 74% among all the classifiers.
KNN and Decision Tree produced same result at dataset of 1500 i.e. 65.0%. Dataset of
2800 Decision Tree produced the highest result i.e. 72% and SVM produced 71.9%.