Abstract:
The advancement of technology has led to an increase in the number ofitems becoming digital. Nowadays, courts generate a large amount of data (unstructured data) each day, including legal data. The digitization of this kind of content can be advantageous to court petitioners, attorneys, and law students in a variety of ways. Legal judgment prediction resolves the issue because it is now simple to search for relevant data from a big body of information. Researchers will make their predictions about court judgments based on the outcomes, similarities to criminal cases, income texts, copywriting, etc. However legal data differs from regular data in terms of vocabulary, language use, and other factors. When making predictions about legal judgments, keep in mind that legal facts contain a significant semantic connection among the text. So, for the sake of achieving semantic dependency, we have chosen some special NLP algorithms. To keep the following problems in mind in the proposed work we designed a legal dataset with two comnponents or just two files. One contains testing data, which consists of 70 files, while the other contains training data, which consists of 2878 files. For training, we used data from Aila2019 (Indian Supreme Court data), and for testing, we used data from the Supreme Court of Pakistan. Since this data was unstructured, we first labeled it and divided it into multiple categories (columns). The data set is appropriately labeled categorized and segmented based on the given information. At this stage, data was in a structured format. The sections or columns that come after are the court's name, petition number, title, date, facts, issue, the decision and holdings, separate opinions, analysis, and the results. We made use of Power Bi, Tableau, and Jupyter notebook (Python). We employed machine leaming and natural language processing (NLP) methods for prediction. We used a hybrid approach by combining the machine learning methods XGBoost classifier, SVM, Random Forest, and Decision tree classifier, linear regression, Multi- Naive base with TF-IDF and Word2vec as word embedding techniques. Gradient boosting classifier gives good accuracy among all. We used TF-IDF on the data initially, followed by TF IDF with N-grams, which provided accuracy between 0.689 and 0.77. We employ the word2vect model for word embedding, which provides accuracy between "0.80 to 0.86" for all applied classifiers, to increase accuracy and obtain higher semantic meaning. We used accuracy, Fl-score precision, and recall to prescnt the results. The limitation of the proposed study is we did not categorize the judgments in our work according to their categories (criminal or test cases etc.). Therefore, it can be done as follows in the future.