DSpace Repository

Preserving Digital Urdu News Stories

Show simple item record

dc.contributor.author Syed Mehtab Alam, 01-243142-009
dc.date.accessioned 2017-08-02T06:13:49Z
dc.date.available 2017-08-02T06:13:49Z
dc.date.issued 2017
dc.identifier.uri http://hdl.handle.net/123456789/3469
dc.description Supervised by Dr. Arif Ur Rahman en_US
dc.description.abstract News story plays a vital role in our daily life, It is important to preserve the news stories for future reference to be available for online websites, Government organization and institution respectively. There are many online newspapers websites which publish news stories continuously. However, the stories which are published in the past may disappear after certain period of time. It is important to preserve the old newspapers for the long term. There is no availability of old newspapers on the news websites. It is important to preserve the news stories in a standardize format which can be further available and read using some standard archive dissemination tools. An archive building mechanism is created for preserving Urdu news stories followed by standard preservation models. A preservation format is defined in XML which stores the story and the metadata. Metadata is further categorized into two types, namely explicit metadata and implicit metadata. Explicit metadata is the metadata which is directly available in HTML tags such as publish date, author name and title the news story. Implicit metadata is fetched from inside the story which contains the words with its frequency and parts of speech tag. The part of speech tagging was done using the Urdu Summary Corpus and Software Tools. The lexicon of the Urdu Summary Corpus and Software tools was improved by adding tags for 2300 English to Urdu transliterated words. The POS tag for each transliterated words in stories are tagged as ’LWEN’ (Lone Words English). The preserved news stories are analyzed to calculate the number of transliterated words. The results show that there are 9.5 percent transliterated words used in the 600 archived stories. en_US
dc.language.iso en en_US
dc.publisher Bahria University Islamabad Campus en_US
dc.relation.ispartofseries MS (CS);T-5878
dc.subject Computer Science. en_US
dc.title Preserving Digital Urdu News Stories en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account