WEB APPLICATION VULNERABILITY ANALYSIS USING MACHINE LEARNING

Marrium Mehmood, 01-241201-009

DSpace Home
→
Thesis/Dissertation Repository Engineering School Islamabad
→
Department of Software Engineering (BUES)
→
MS(SE) (BUES)
→
View Item

WEB APPLICATION VULNERABILITY ANALYSIS USING MACHINE LEARNING

Marrium Mehmood, 01-241201-009

URI: http://hdl.handle.net/123456789/14455

Date: 2022

Abstract:

A substantial increase in web-based applications has changed business perspectives and web applications are becoming almost essential to every business for online transactions. This has paved the way for cyber-attacks where hackers and attackers make use of online system vulnerabilities for illegal system usage. In order to protect web application, different types of detection systems are proposed based on machine learning. Most of the systems detect few and outdated attacks as this area of study lacks a good labelled dataset containing modern attacks. This thesis proposes a payload-based web attack detection considering modern attacks as stated by OWASP and NIST. The proposed system detects web attacks by analyzing the payload. Our system is designed in two stages: pre-processing and processing. The pre-processing step consists of dataset creation, feature extraction and feature selection. To experimentally evaluate our system, we used an additional HTTP Param publicly available dataset with our payload-based dataset. We implemented an automatic feature extraction technique to extract features from the payload with TF-IDF vectorizer to enhance the performance. Three types of n-grams: unigram, bigram and trigram are used separately and results are analyzed. We implemented four feature selection techniques: Correlation-based feature selection (CFS), mutual info, random forest importance and Principal Component Analysis (PCA) to obtain a best feature subset. The processing step comprises of implementing multiple classifiers with the purpose of comparing results and performance. In particular, we implemented four machine learning classifiers: decision trees, random forest, logistic regression and K-Nearest Neighbor and feed different feature subsets as an input and analyzed the results. We performed multiclass-classification of attacks and results are evaluated using multiple evaluation measures. Besides, the results of best performing model are cross-validated using 10 folds crossvalidation. Our system is able to detect 8 vulnerabilities: SQL injection, Cross-Site Scripting, XML external entities, Command injections, open redirect, carriage return and line field injections, path traversal, file inclusions and also normal requests. The comparative analysis of experimental results demonstrates that embedded methods for bigram extracted features under random forest classifier achieve highest score of 99.48% accuracy, 98.66% precision, 96.50% recall and 97.41% F1-Score.