Abstract:
Analysing Twitter user data, more specifically public messages or tweets can be very
useful in monitoring diseases and infections worldwide. Diseases, identified by
specific symptoms and signs and classified as medical conditions, if left unchecked,
can cause a lot of damage to an area’s resources and populace. Not to mention,
keeping people aware of potential dangers or epidemics significantly reduces
widespread hysteria and prepares them for the worst during times of crisis. Detecting
disease outbreaks before they occur would be incredibly useful tool for the health
sector and its even possible in this time, where people post their problems on social
media and through text messages causing a massive amount of data to be transmitted
on a daily basis. First, our system collects health-related tweets using Twitter API
and filters, cleans and tags those tweets to create a functional, usable dataset.
Essentially, this means that only tweets mentioning diseases and containing proper
locations are added to our dataset. We used both SVM and Naive Bayes algorithm
for data classification and tagging which resulted in accuracy rates of 82% and 75%
respectively. The TF-IDF vectorizer was used for feature extraction in both of these
algorithms. For map plotting and visualization, we used a simple HTML/CSS page
with JavaScript to show our findings on a map and make the results more readable
and interactive. We decided to use Flask, a third-party Python library for extensible
web microframework to achieve our final goal. Our research shows that Twitter data
has many applications for public health research.