Welcome to the Bahria University DSpace digital repository. DSpace is a digital service that collects, preserves, and distributes digital material. Repositories are important tools for preserving an organization's legacy; they facilitate digital preservation and scholarly communication.
dc.contributor.author | Nehal Yasin, 01-249211-011 | |
dc.date.accessioned | 2023-05-24T07:40:03Z | |
dc.date.available | 2023-05-24T07:40:03Z | |
dc.date.issued | 2023 | |
dc.identifier.uri | http://hdl.handle.net/123456789/15537 | |
dc.description | Supervised by Dr. Imran Siddiqi | en_US |
dc.description.abstract | Recognition of text in non-cursive scripts has received significant research attention in the recent years. Thanks to the recent advancements in different areas of deep (machine) learning, robust end-to-end recognition systems have been developed. While most of the research on this subject targets development of recognition engines, fewer efforts have been made to post-process the noisy output of the OCR to improve the recognition rates. This study investigates transformer-based Neural Machine Translation (NMT) for post-processing of noisy Optical Character Recognition (OCR) output. While recognition engines have matured significantly for most languages, the problem still remains challenging for text in cursive scripts reporting high character but relatively lower word recognition rates. This study targets post-processing of the noisy output of OCR for cursive text (using Urdu as a case study) and leverages a transformer-based NMT framework to correct the OCR errors. More specifically, we feed the noisy text as input and the correct transcription as target to the transformer. The model is trained and evaluated on such pairs of text collected from News tickers in videos and tweets from multiple News channels. A comprehensive experimental study shows significant performance improvement by introducing the proposed post-processing step. It is expected that the findings of this study would be useful in developing various applications on top of the OCR for a low resource language like Urdu. | en_US |
dc.language.iso | en | en_US |
dc.publisher | Computer Sciences | en_US |
dc.relation.ispartofseries | MS (DS);T-01980 | |
dc.subject | Neural Machine | en_US |
dc.subject | Post-OCR Error | en_US |
dc.title | Neural Machine Translation for Post-OCR Error Correction | en_US |
dc.type | MS Thesis | en_US |