DSpace Repository

Deep Feature Learning for Visual Question Answering

Show simple item record

dc.contributor.author Saman Yasmin, 01-243192-009
dc.date.accessioned 2022-01-17T11:39:07Z
dc.date.available 2022-01-17T11:39:07Z
dc.date.issued 2021
dc.identifier.uri http://hdl.handle.net/123456789/11660
dc.description Supervised by Dr. Imran Ahmed Siddiqi en_US
dc.description.abstract Visual Question answering (VQA) task is an emerging challenge for the computer vision experts, these models can predict an answer about the asked question. The models are capable of relating a question with the image to generate an answer. These type of models are really helpful for the visually-impaired people to get information about the surrounding environment or a certain images. Reading and understanding the text from the images along with their visual information is a great challenge to generate an answer. Several scholars have proposed an Artificial Intelligence (AI) systems that can learn the features from question and images to predict an accurate answer, but they lacked the learning of textual features of images which plays an important role in predicting an answer. The textual featw-es of an image gives useful clues and information for the prediction. These systems can help in the development of space technologies in order to make clarifications about particular situation by asking questions. Computer vision experts can develop an AI system for blind or visually-impaired people, so they can ask questions regarding any particular scene and environment. Using textual features in the model can predict more accurate answer. We have proposed a method to use textual features along with the visual features in order to predict an an::;wer. The model used the detected text and objects from an image and created a co-related mechanism to represent their features. Our model is trained on deep learning model to predict an answer. We have used pre-trained CNN , Bilstm and OCR engine to retrieve and represent images and questions. Our model can identify and localize the text in the images in order to jointly use them with question. We have used OCR results to get word to vector and NER-tagging representations from the text and combine them with the results of CNN and Bilstm to feed a composite vectore in to a fully-connected layers. Experimental results are generated from our deep learning predictive models, and we have evaluated our model using different evaluation metrics such as precision, recall and Fl-score. en_US
dc.language.iso en en_US
dc.publisher Computer Sciences BUIC en_US
dc.relation.ispartofseries MS (CS);T-9721
dc.subject Deep Feature Learning en_US
dc.subject Visual Question Answering en_US
dc.title Deep Feature Learning for Visual Question Answering en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account