Abstract:
Quran is the sacred book of Muslims and a source of guidance for them. It discusses topics
related to religion and worldly affairs by which Quranic verses can be categorized. Some
topics in the Quran have been talked about a lot whereas some have not been discussed very
much, therefore the dataset has imbalance. Hence, first the data is balanced by creating
synthetic samples of minority class. There are several verses that use same words to depict
different concepts whereas some of the verses use different words to depict similar meaning.
Thus it is important to classify them on the basis of their context. Previously Tafsir and
Hadith data has been used to better understand the context which makes the classification
of Quranic verses dependent on additional corpora. Other techniques like Word2Vec and
GloVe word embedding have also been used which have the limitation of ignoring rare
words and position of the words, during classification. This study aims to classify the verses
according to their topics by considering the context of words using Bidirectional Encoder
Representation from Transformers (BERT). While creating representations of a word, BERT
reads all its neighboring words and assigns representations accordingly. It creates 3-
dimensional word embedding and assigns 768 representations to each token. Furthermore,
to ensure that the classifier remembers the most important part of the input sequence, deep
learning classifiers with Long Short-Term Memory (LSTM) and Gated Recurrent Unit
(GRU) are used for classification. As the BERT cased and uncased word embeddings of the
text data are created, they are fed to 3 Neural Network (NN) models i.e. NN with LSTM
which achieved F1-scores of 0.87 for uncased and 0.86 for cased embedding, NN with GRU
which achieved F1-scores of 0.91 for uncased and 0.90 for cased embedding, and fine-tuned
BERT model which achieved F1-scores of 0.93 for both base-uncased and base-cased.