Data Mining and Text Classification

Welcome to DSpace BU Repository

Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.

Show simple item record

dc.contributor.author Zarif, Qureshi Afnan Reg # 11959
dc.contributor.author Jamil, Muhammad Sufyan Reg # 11998
dc.date.accessioned 2017-06-20T05:13:11Z
dc.date.available 2017-06-20T05:13:11Z
dc.date.issued 2011-11
dc.identifier.uri http://hdl.handle.net/123456789/1857
dc.description Supervised by Rauf Ahmed Shams Malick en_US
dc.description.abstract During past decades there is been incredible growth in data present on Web, digital libraries, technical documentation, medical data etc which give births to ambiguity among data and confused content management which reflects its importance and the need has become an essential part of web. In premature days and still this tough job is carrying out manually with spoiling more efforts resulting reduced productivity and high consumption of valuable time. When we started to target the valuable problem by researching work which has already been completed to undertake this competitive problem, we started to build and study all techniques of IR (Information Retrieval) and Classification which was proposed to target such kind of scenarios. It was a discouraging list which soon clears the boundaries which were fuzzy and hierarchy was full of massive complicated loops. Therefore we decided to target some selected proposals which was highlighted in different discussion forums and reviews. Fortunately, within the restricted boundaries of time and limited literature reserves we were able to discover a wise solution. This system targets this key issue which reduces efforts and discontinues the wastage of valuable time by providing the intellectual means of content classification by revealing identical contents and preferred classes of text in provided huge textual datasets. The general idea of this system is to distinguish amongst the different categories of textual datasets which may exist in different variety (Web Pages are prime target in this project) etc. This project helps in categorization of data and better content management ever before. The main contributions of technique which are utilized in this solutions are SVM (Support Vector Machine) for learning or training system on some particular topics which helps in making wise decisions of classification, TF-IDF (Term Frequency) which helps out in learning engine, Stemmer which diminishes the complexity of documents by trimming it into their base/root word, {K-Means, STC, Lingo} all these different algorithms are utilized in identifying identical classes of documents and LUCENE an open source tool for indexing documents targeted for fast content retrieval. en_US
dc.language.iso en_US en_US
dc.publisher Bahria University Karachi Campus en_US
dc.title Data Mining and Text Classification en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account