Document Visual Question Answering

Eiza Batool, 01-249202-002

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (DS) (BUIC-E-8)
→
View Item

dc.contributor.author	Eiza Batool, 01-249202-002
dc.date.accessioned	2022-12-21T10:29:22Z
dc.date.available	2022-12-21T10:29:22Z
dc.date.issued	2022
dc.identifier.uri	http://hdl.handle.net/123456789/14476
dc.description	Supervised by Dr.Imran Ahmed Siddiqi	en_US
dc.description.abstract	Visual Question Answering is one of the most emerging problems in computer vision and natural language processing (NLP). Visual question answering is the process of answering the question about an image by using visual elements of the image and inference gathered from textual questions. In most cases, Visual Question Answering models only consider visual features while ignoring the textual content in a given scene or image. For VQA in document images, visual as well as the textual information plays a key role in finding appropriate answers to the posted questions. This research targets the problem of VQA in document images by exploiting both visual and textual information leveraging the recent advancements in deep learning. The focus of this study is to answer the question that defined on an document image. We have proposed a method to use textual features along with the visual features in order to predict an answer. We used DocVQA dataset that includes 50k questions and answers, 12k+ document images. In our system, the model takes the question, Optical Character Recognition(OCR) and an image as input and deep learning model processed the input in order to generate an answer. We have used pre-trained inception v3 to represent the image, and Gated recurrent unit to represent the question and OCR. To create our VQA system, we used different deep learning techniques, functions, and approaches. Experimental results are generated from our deep learning predictive models, and we have evaluated our model using evaluation metrics such as Average Normalized Levenshtein Similarity(ANLS) score. The inception v3 with OCR and attention model perform well as compared to the other models. The OCR and attention model played a vital role in enhancing the performance.	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences	en_US
dc.relation.ispartofseries	MS (DS);T-1130
dc.subject	Optical Character Recognition	en_US
dc.subject	Natural Language Processing	en_US
dc.title	Document Visual Question Answering	en_US
dc.type	MS Thesis	en_US