Abstract:
News story plays a vital role in our daily life, It is important to preserve the news stories
for future reference to be available for online websites, Government organization and
institution respectively. There are many online newspapers websites which publish news
stories continuously. However, the stories which are published in the past may disappear
after certain period of time. It is important to preserve the old newspapers for the long
term. There is no availability of old newspapers on the news websites. It is important to
preserve the news stories in a standardize format which can be further available and read
using some standard archive dissemination tools. An archive building mechanism is created for preserving Urdu news stories followed by standard preservation models. A preservation format is defined in XML which stores the story and the metadata. Metadata is further categorized into two types, namely explicit metadata and implicit metadata. Explicit metadata is the metadata which is directly available in HTML tags such as publish date, author name and title the news story. Implicit metadata is fetched from inside the story which contains the words with its frequency and parts of speech tag. The part of speech tagging was done using the Urdu Summary Corpus and Software Tools. The lexicon of the Urdu Summary Corpus and Software tools was improved by adding tags for 2300 English to Urdu transliterated words. The POS tag for each transliterated words in stories are tagged as ’LWEN’ (Lone Words English). The preserved news stories are analyzed to calculate the number of transliterated words. The results show that there are 9.5 percent transliterated words used in the 600 archived stories.