Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.
dc.contributor.author | Zarif, Qureshi Afnan Reg # 11959 | |
dc.contributor.author | Jamil, Muhammad Sufyan Reg # 11998 | |
dc.date.accessioned | 2017-06-20T05:13:11Z | |
dc.date.available | 2017-06-20T05:13:11Z | |
dc.date.issued | 2011-11 | |
dc.identifier.uri | http://hdl.handle.net/123456789/1857 | |
dc.description | Supervised by Rauf Ahmed Shams Malick | en_US |
dc.description.abstract | During past decades there is been incredible growth in data present on Web, digital libraries, technical documentation, medical data etc which give births to ambiguity among data and confused content management which reflects its importance and the need has become an essential part of web. In premature days and still this tough job is carrying out manually with spoiling more efforts resulting reduced productivity and high consumption of valuable time. When we started to target the valuable problem by researching work which has already been completed to undertake this competitive problem, we started to build and study all techniques of IR (Information Retrieval) and Classification which was proposed to target such kind of scenarios. It was a discouraging list which soon clears the boundaries which were fuzzy and hierarchy was full of massive complicated loops. Therefore we decided to target some selected proposals which was highlighted in different discussion forums and reviews. Fortunately, within the restricted boundaries of time and limited literature reserves we were able to discover a wise solution. This system targets this key issue which reduces efforts and discontinues the wastage of valuable time by providing the intellectual means of content classification by revealing identical contents and preferred classes of text in provided huge textual datasets. The general idea of this system is to distinguish amongst the different categories of textual datasets which may exist in different variety (Web Pages are prime target in this project) etc. This project helps in categorization of data and better content management ever before. The main contributions of technique which are utilized in this solutions are SVM (Support Vector Machine) for learning or training system on some particular topics which helps in making wise decisions of classification, TF-IDF (Term Frequency) which helps out in learning engine, Stemmer which diminishes the complexity of documents by trimming it into their base/root word, {K-Means, STC, Lingo} all these different algorithms are utilized in identifying identical classes of documents and LUCENE an open source tool for indexing documents targeted for fast content retrieval. | en_US |
dc.language.iso | en_US | en_US |
dc.publisher | Bahria University Karachi Campus | en_US |
dc.title | Data Mining and Text Classification | en_US |
dc.type | Thesis | en_US |