| dc.description.abstract |
Forecasting political events using a variety of information sources has become a practice in the era of Big Data. Analyzing data collected from multiple sources to predict the outcome of elections has been found effective in various studies. It overcomes the limitations of an individual data source which can be biased or may not be a representative sample of the society. In this study, a data fusion based approach using data from variety of sources including historical election results, opinion polls, and Twitter as social media data source to predict the results of elections is presented. A forecasting model for the practical and timely prediction of results of elections is developed using a dataset prepared from the fusion of multiple data sources. The forecasting model composed of two main processes, first the preparation of dataset from multiple secures and the second to devise an approach to analyse fused dataset. A data fusion pipeline is provided for the preparation of dataset. First, historical vote share for the candidates of each party in a district was computed and fed into the model using feed forward approach. Second, the parameters from other sources were computed in real time prior to each election. These parameters included survey score from the aggregation of various opinion polls and the popularity score of each party from tweets dataset. The predictive power of fused dataset was analyzed using machine learning algorithms to forecast the vote share of each candidate at constituency level. Our study focused on the fact that fusing more data sources improves the prediction results of an election in terms of accuracy and reliability. The proposed model was experimented on fused dataset with state-of-the-art machine learning algorithms using multiple linear regression (MLR) as baseline model. The baseline model performed well to prove the significance of hypothesis of the study. It was then applied other advance regression algorithms considering ensemble learning. StackingCVRegressor, a stacking of RidgeCV, SVR, and XGBoost optimized with SVR as meta-regressor. Stacking of regression models outperformed MLR, RidgeCV, SVR, and XGBoost reducing RMSE by 7% in comparison with baseline MLR. The proposed model was implemented to predict the election results of Pakistan General Elections 2018 at first attempt. The prediction results show the reliability of our data fusion process. The analysis shows that improved results were obtained from the fusion of more data sources than the results from the individual data source or fusion of less data sources. Our model predicts the vote share at national level as well as the constituency level. It predicted an over all 85% vote share for all the political parties at national level. Our model correctly predicted 177 out of 270 seats for the major political parties which is 27 seats higher than a recent similar study. The proposed model also outperforms the opinion polls and results from Twitter statistics. However, it could not predict the candidates as winner where the social media data was not available. |
en_US |