Abstract:
Visual Question Answering is one of the most emerging problems in computer vision and natural language processing (NLP). Visual question answering is the process of answering the question about an image by using visual elements of the image and inference gathered from textual questions. In most cases, Visual Question Answering models only consider visual features while ignoring the textual content in a given scene or image. For VQA in document images, visual as well as the textual information plays a key role in finding appropriate answers to the posted questions. This research targets the problem of VQA in document images by exploiting both visual and textual information leveraging the recent advancements in deep learning. The focus of this study is to answer the question that defined on an document image. We have proposed a method to use textual features along with the visual features in order to predict an answer. We used DocVQA dataset that includes 50k questions and answers, 12k+ document images. In our system, the model takes the question, Optical Character Recognition(OCR) and an image as input and deep learning model processed the input in order to generate an answer. We have used pre-trained inception v3 to represent the image, and Gated recurrent unit to represent the question and OCR. To create our VQA system, we used different deep learning techniques, functions, and approaches. Experimental results are generated from our deep learning predictive models, and we have evaluated our model using evaluation metrics such as Average Normalized Levenshtein Similarity(ANLS) score. The inception v3 with OCR and attention model perform well as compared to the other models. The OCR and attention model played a vital role in enhancing the performance.