Neural Machine Translation for Post-OCR Error Correction

Nehal Yasin, 01-249211-011

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (DS) (BUIC-E-8)
→
View Item

dc.contributor.author	Nehal Yasin, 01-249211-011
dc.date.accessioned	2023-05-24T07:40:03Z
dc.date.available	2023-05-24T07:40:03Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/123456789/15537
dc.description	Supervised by Dr. Imran Siddiqi	en_US
dc.description.abstract	Recognition of text in non-cursive scripts has received significant research attention in the recent years. Thanks to the recent advancements in different areas of deep (machine) learning, robust end-to-end recognition systems have been developed. While most of the research on this subject targets development of recognition engines, fewer efforts have been made to post-process the noisy output of the OCR to improve the recognition rates. This study investigates transformer-based Neural Machine Translation (NMT) for post-processing of noisy Optical Character Recognition (OCR) output. While recognition engines have matured significantly for most languages, the problem still remains challenging for text in cursive scripts reporting high character but relatively lower word recognition rates. This study targets post-processing of the noisy output of OCR for cursive text (using Urdu as a case study) and leverages a transformer-based NMT framework to correct the OCR errors. More specifically, we feed the noisy text as input and the correct transcription as target to the transformer. The model is trained and evaluated on such pairs of text collected from News tickers in videos and tweets from multiple News channels. A comprehensive experimental study shows significant performance improvement by introducing the proposed post-processing step. It is expected that the findings of this study would be useful in developing various applications on top of the OCR for a low resource language like Urdu.	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences	en_US
dc.relation.ispartofseries	MS (DS);T-953
dc.subject	Neural Machine	en_US
dc.subject	Post-OCR Error	en_US
dc.title	Neural Machine Translation for Post-OCR Error Correction	en_US
dc.type	MS Thesis	en_US