Neural Machine Translation for Post-OCR Error Correction

Welcome to DSpace BU Repository

Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.

Show simple item record

dc.contributor.author Nehal Yasin, 01-249211-011
dc.date.accessioned 2023-05-24T07:40:03Z
dc.date.available 2023-05-24T07:40:03Z
dc.date.issued 2023
dc.identifier.uri http://hdl.handle.net/123456789/15537
dc.description Supervised by Dr. Imran Siddiqi en_US
dc.description.abstract Recognition of text in non-cursive scripts has received significant research attention in the recent years. Thanks to the recent advancements in different areas of deep (machine) learning, robust end-to-end recognition systems have been developed. While most of the research on this subject targets development of recognition engines, fewer efforts have been made to post-process the noisy output of the OCR to improve the recognition rates. This study investigates transformer-based Neural Machine Translation (NMT) for post-processing of noisy Optical Character Recognition (OCR) output. While recognition engines have matured significantly for most languages, the problem still remains challenging for text in cursive scripts reporting high character but relatively lower word recognition rates. This study targets post-processing of the noisy output of OCR for cursive text (using Urdu as a case study) and leverages a transformer-based NMT framework to correct the OCR errors. More specifically, we feed the noisy text as input and the correct transcription as target to the transformer. The model is trained and evaluated on such pairs of text collected from News tickers in videos and tweets from multiple News channels. A comprehensive experimental study shows significant performance improvement by introducing the proposed post-processing step. It is expected that the findings of this study would be useful in developing various applications on top of the OCR for a low resource language like Urdu. en_US
dc.language.iso en en_US
dc.publisher Computer Sciences en_US
dc.relation.ispartofseries MS (DS);T-01980
dc.subject Neural Machine en_US
dc.subject Post-OCR Error en_US
dc.title Neural Machine Translation for Post-OCR Error Correction en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account