Abstract:
We live in information age. There is so much information emerging over the internet that
it is next to impossible to be able to go through all of it. This project is focused on
extracting “interesting” information from the web. As a first step, we assume that
newspaper report the most interesting information and thus develop a system that is able
to extract interesting information from the internet using the news feed from news
websites. The system is fully automated and only relies on a few input parameters.
System requires an RSS feed from the described resources then it extracts title of the
news from the RSS feed. Next, the system removes the repeating/insignificant words
from the news title and a tokenization module transforms these keywords into tokens.
These tokens are combined to form sequences of items in a time-order manner. A
sequence mining algorithm is applied to extract most interesting sequences and a detokenization
process is able to extract the most interesting news.