Abstract:
Urdu language is widely spoken around the world by almost 30 million people but
still it is not given due attention. Also, it is used in areas where the broadband users
are rapidly increasing. People in Pakistan tend to communicate their feelings towards
an entity in Urdu, roman Urdu or English. There are several tools and techniques
available for performing sentiment analysis but as the data is ever growing, the
uncertainty in data is also growing.
In this research, we have put our efforts to bring about a system that performs
sentiment analysis on Urdu, Roman Urdu and English which are the three most
widely used languages for comments on Pakistani products.
In order to extract the sentiment expressed in the text we wanted to implement such a
model that could handle comments or reviews written in all the three languages.
People often write their comments in roman Urdu which is Urdu written in roman
script. The challenge with roman Urdu is that it has no defined structure, lexicon or
grammar. Thus, in order to process the comments we converted all the reviews and
comments to English and then applied sentiment analysis models on it.
We have used multiple classifiers which include Naïve Bayes, Random Forest and
SVM for this purpose. At the end we have discussed the performance of the classifiers
and it was concluded that SVM outperformed the rest of the classifiers.