Abstract:
In recent years, the rapid expansion of Virtual Learning Environments (VLEs) and online education platforms has significant ly transformed higher education, simultaneously introducing new challenges such as increased student attrition and disengagement. Predicting student performance within these digital frameworks is essential to enabling timely interventions and improving academic outcomes. This thesis addresses the prediction of high-risk students by leveraging the Open University Learning Analytics Dataset (OULAD), a comprehensive benchmark comprising demographic, assessment, and detailed engagement data for over 32,000 students. Recognizing gaps in the literature—particularly the lack of systematic comparisons across multiple feature selection techniques—this study develops a robust machine learning pipeline that benchmarks three distinct feature selection strategies: Particle Swarm Optimization (PSO), Gini-based importance, and Factor Analysis of Mixed Data (FAMD). Four primary classifiers, including Random Forest (RF), Gradient Boosting (GB), Logistic Regression (LR), and Artificial Neural Networks (ANN), were evaluated alongside ensemble and stacked models under stratified cross-validation. The results demonstrate that PSO reduced the feature space by approximately 40% while maintaining high predictive performance (F1 0.937, AUC 0.980). Gini importance identified a compact set of 10 critical features achieving comparable metrics, whereas FAMD, although efficient, showed slightly lower predictive strength. Overall, the best outcomes were achieved through ensemble and stacked architectures, with accuracies approaching 94% and AUC scores near 0.981 comparable to advanced deep learning benchmarks but with improved interpretability. This work contributes a balanced, explainable framework for educational early warning systems, bridging methodological gaps in comparative feature selection and offering practical insights for scalable deployment in online learning contexts. Unlike prior studies that focused on single feature selection or deep learning models, this thesis introduces a comparative multi-feature selection framework (PSO, Gini, FAMD) integrated with ensemble machine learning pipelines. This systematic benchmarking approach provides new insights into the trade-offs between accuracy, interpretability, and scalability, making it a novel contribution to educational predictive analytics.