Abstract:
Hate speech, characterized by its cruelty, intentionality, and recurrent nature within online social networks, is a critical concern in contemporary literature. The widespread popularity of social media platforms such as Facebook and Twitter has granted users unprecedented freedom to share content without constraints. While numerous solutions for hate speech detection have been proposed in languages like English, Arabic, Dutch, Hindi, and Marathi, the Roman Urdu language remains underexplored in this regard. This study addresses this gap by introducing a hate speech detection method tailored to Roman Urdu. We begin by crawling and normalizing Twitter data using preprocessing techniques. Subsequently, we employ twelve distinct machine learning classifiers, including Logistic Regression, Random Forest, Decision Tree, SVM, Multinomial Naïve Bayes, K-NN, Extra Trees Classifier, AdaBoost Classifier, Nearest Centroid, SGD Classifier, K-neighbors, and Gradient Boosting Classifier. These classifiers are trained on our balanced binary-class dataset, with labels denoting bullied (1) and non-bullied (0) content, totaling 8754 instances. To enhance feature representation, we employ TF-IDF feature extraction. The benchmark dataset is meticulously trained, and each of the twelve machine learning classifiers undergoes rigorous evaluation, with performance metrics including accuracy, precision, recall, and F1 score. Furthermore, we apply a K-fold cross-validation approach with a factor of 5 to validate classifier performance. Notably, the impact of K-fold variation on classifier performance is observed, highlighting its influence on results. Among this diverse ensemble of classifiers, the AdaBoost Classifier emerges as the unequivocal standout. It demonstrates exceptional accuracy, scoring 0.75 in the primary implementation and a closely matching 0.751 during K-fold cross-validation. AdaBoost Classifier's unwavering consistency underscores its predictive accuracy, marking it as a formidable tool in hate speech detection.