Abstract:
During past decades there is been incredible growth in data present on Web, digital
libraries, technical documentation, medical data etc which give births to ambiguity
among data and confused content management which reflects its importance and the need has become an essential part of web. In premature days and still this tough job is carrying out manually with spoiling more efforts resulting reduced productivity and high
consumption of valuable time.
When we started to target the valuable problem by researching work which has already
been completed to undertake this competitive problem, we started to build and study all
techniques of IR (Information Retrieval) and Classification which was proposed to target
such kind of scenarios. It was a discouraging list which soon clears the boundaries which
were fuzzy and hierarchy was full of massive complicated loops. Therefore we decided to
target some selected proposals which was highlighted in different discussion forums and
reviews.
Fortunately, within the restricted boundaries of time and limited literature reserves we
were able to discover a wise solution. This system targets this key issue which reduces
efforts and discontinues the wastage of valuable time by providing the intellectual means
of content classification by revealing identical contents and preferred classes of text in
provided huge textual datasets.
The general idea of this system is to distinguish amongst the different categories of
textual datasets which may exist in different variety (Web Pages are prime target in this
project) etc. This project helps in categorization of data and better content management
ever before.
The main contributions of technique which are utilized in this solutions are SVM
(Support Vector Machine) for learning or training system on some particular topics which
helps in making wise decisions of classification, TF-IDF (Term Frequency) which helps
out in learning engine, Stemmer which diminishes the complexity of documents by
trimming it into their base/root word, {K-Means, STC, Lingo} all these different
algorithms are utilized in identifying identical classes of documents and LUCENE an
open source tool for indexing documents targeted for fast content retrieval.