Abstract:
A substantial increase in web-based applications has changed business perspectives and
web applications are becoming almost essential to every business for online transactions.
This has paved the way for cyber-attacks where hackers and attackers make use of online
system vulnerabilities for illegal system usage. In order to protect web application,
different types of detection systems are proposed based on machine learning. Most of the
systems detect few and outdated attacks as this area of study lacks a good labelled dataset
containing modern attacks. This thesis proposes a payload-based web attack detection
considering modern attacks as stated by OWASP and NIST. The proposed system detects
web attacks by analyzing the payload. Our system is designed in two stages: pre-processing
and processing. The pre-processing step consists of dataset creation, feature extraction and
feature selection. To experimentally evaluate our system, we used an additional HTTP
Param publicly available dataset with our payload-based dataset. We implemented an
automatic feature extraction technique to extract features from the payload with TF-IDF
vectorizer to enhance the performance. Three types of n-grams: unigram, bigram and
trigram are used separately and results are analyzed. We implemented four feature selection
techniques: Correlation-based feature selection (CFS), mutual info, random forest
importance and Principal Component Analysis (PCA) to obtain a best feature subset. The
processing step comprises of implementing multiple classifiers with the purpose of
comparing results and performance. In particular, we implemented four machine learning
classifiers: decision trees, random forest, logistic regression and K-Nearest Neighbor and
feed different feature subsets as an input and analyzed the results. We performed multiclass-classification of attacks and results are evaluated using multiple evaluation measures.
Besides, the results of best performing model are cross-validated using 10 folds crossvalidation. Our system is able to detect 8 vulnerabilities: SQL injection, Cross-Site
Scripting, XML external entities, Command injections, open redirect, carriage return and
line field injections, path traversal, file inclusions and also normal requests. The
comparative analysis of experimental results demonstrates that embedded methods for
bigram extracted features under random forest classifier achieve highest score of 99.48%
accuracy, 98.66% precision, 96.50% recall and 97.41% F1-Score.