Preserving Digital Urdu News Stories

Syed Mehtab Alam, 01-243142-009

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (CS) (BUIC-E-8)
→
View Item

Preserving Digital Urdu News Stories

Syed Mehtab Alam, 01-243142-009

URI: http://hdl.handle.net/123456789/3469

Date: 2017

Abstract:

News story plays a vital role in our daily life, It is important to preserve the news stories for future reference to be available for online websites, Government organization and institution respectively. There are many online newspapers websites which publish news stories continuously. However, the stories which are published in the past may disappear after certain period of time. It is important to preserve the old newspapers for the long term. There is no availability of old newspapers on the news websites. It is important to preserve the news stories in a standardize format which can be further available and read using some standard archive dissemination tools. An archive building mechanism is created for preserving Urdu news stories followed by standard preservation models. A preservation format is defined in XML which stores the story and the metadata. Metadata is further categorized into two types, namely explicit metadata and implicit metadata. Explicit metadata is the metadata which is directly available in HTML tags such as publish date, author name and title the news story. Implicit metadata is fetched from inside the story which contains the words with its frequency and parts of speech tag. The part of speech tagging was done using the Urdu Summary Corpus and Software Tools. The lexicon of the Urdu Summary Corpus and Software tools was improved by adding tags for 2300 English to Urdu transliterated words. The POS tag for each transliterated words in stories are tagged as ’LWEN’ (Lone Words English). The preserved news stories are analyzed to calculate the number of transliterated words. The results show that there are 9.5 percent transliterated words used in the 600 archived stories.