Data Mining and Text Classification

Zarif, Qureshi Afnan Reg # 11959; Jamil, Muhammad Sufyan Reg # 11998

DSpace Home
→
Thesis/Dissertation Repository Karachi Campus
→
Department of Software Engineering (BUKC)
→
BSE (BUKC)
→
View Item

dc.contributor.author	Zarif, Qureshi Afnan Reg # 11959
dc.contributor.author	Jamil, Muhammad Sufyan Reg # 11998
dc.date.accessioned	2017-06-20T05:13:11Z
dc.date.available	2017-06-20T05:13:11Z
dc.date.issued	2011-11
dc.identifier.uri	http://hdl.handle.net/123456789/1857
dc.description	Supervised by Rauf Ahmed Shams Malick	en_US
dc.description.abstract	During past decades there is been incredible growth in data present on Web, digital libraries, technical documentation, medical data etc which give births to ambiguity among data and confused content management which reflects its importance and the need has become an essential part of web. In premature days and still this tough job is carrying out manually with spoiling more efforts resulting reduced productivity and high consumption of valuable time. When we started to target the valuable problem by researching work which has already been completed to undertake this competitive problem, we started to build and study all techniques of IR (Information Retrieval) and Classification which was proposed to target such kind of scenarios. It was a discouraging list which soon clears the boundaries which were fuzzy and hierarchy was full of massive complicated loops. Therefore we decided to target some selected proposals which was highlighted in different discussion forums and reviews. Fortunately, within the restricted boundaries of time and limited literature reserves we were able to discover a wise solution. This system targets this key issue which reduces efforts and discontinues the wastage of valuable time by providing the intellectual means of content classification by revealing identical contents and preferred classes of text in provided huge textual datasets. The general idea of this system is to distinguish amongst the different categories of textual datasets which may exist in different variety (Web Pages are prime target in this project) etc. This project helps in categorization of data and better content management ever before. The main contributions of technique which are utilized in this solutions are SVM (Support Vector Machine) for learning or training system on some particular topics which helps in making wise decisions of classification, TF-IDF (Term Frequency) which helps out in learning engine, Stemmer which diminishes the complexity of documents by trimming it into their base/root word, {K-Means, STC, Lingo} all these different algorithms are utilized in identifying identical classes of documents and LUCENE an open source tool for indexing documents targeted for fast content retrieval.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Bahria University Karachi Campus	en_US
dc.title	Data Mining and Text Classification	en_US
dc.type	Thesis	en_US