Deep Feature Learning for Visual Question Answering

Saman Yasmin, 01-243192-009

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (CS) (BUIC-E-8)
→
View Item

dc.contributor.author	Saman Yasmin, 01-243192-009
dc.date.accessioned	2022-01-17T11:39:07Z
dc.date.available	2022-01-17T11:39:07Z
dc.date.issued	2021
dc.identifier.uri	http://hdl.handle.net/123456789/11660
dc.description	Supervised by Dr. Imran Ahmed Siddiqi	en_US
dc.description.abstract	Visual Question answering (VQA) task is an emerging challenge for the computer vision experts, these models can predict an answer about the asked question. The models are capable of relating a question with the image to generate an answer. These type of models are really helpful for the visually-impaired people to get information about the surrounding environment or a certain images. Reading and understanding the text from the images along with their visual information is a great challenge to generate an answer. Several scholars have proposed an Artificial Intelligence (AI) systems that can learn the features from question and images to predict an accurate answer, but they lacked the learning of textual features of images which plays an important role in predicting an answer. The textual featw-es of an image gives useful clues and information for the prediction. These systems can help in the development of space technologies in order to make clarifications about particular situation by asking questions. Computer vision experts can develop an AI system for blind or visually-impaired people, so they can ask questions regarding any particular scene and environment. Using textual features in the model can predict more accurate answer. We have proposed a method to use textual features along with the visual features in order to predict an an::;wer. The model used the detected text and objects from an image and created a co-related mechanism to represent their features. Our model is trained on deep learning model to predict an answer. We have used pre-trained CNN , Bilstm and OCR engine to retrieve and represent images and questions. Our model can identify and localize the text in the images in order to jointly use them with question. We have used OCR results to get word to vector and NER-tagging representations from the text and combine them with the results of CNN and Bilstm to feed a composite vectore in to a fully-connected layers. Experimental results are generated from our deep learning predictive models, and we have evaluated our model using different evaluation metrics such as precision, recall and Fl-score.	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences BUIC	en_US
dc.relation.ispartofseries	MS (CS);T-9721
dc.subject	Deep Feature Learning	en_US
dc.subject	Visual Question Answering	en_US
dc.title	Deep Feature Learning for Visual Question Answering	en_US
dc.type	MS Thesis	en_US