DSpace Repository

Pixels to Phrases: Bridging Vision and Language

Show simple item record

dc.contributor.author Zahid Zaman, 01-249231-020
dc.date.accessioned 2026-02-25T05:14:45Z
dc.date.available 2026-02-25T05:14:45Z
dc.date.issued 2025
dc.identifier.uri http://hdl.handle.net/123456789/20725
dc.description Supervised by Dr. Arif Ur Rahman en_US
dc.description.abstract This thesis presents a context-aware image captioning framework that bridges the gap between visual perception and natural language generation by combining deep learning with structured semantic reasoning. While conventional captioning models primarily rely on convolutional neural networks (CNNs) to extract image features, they often fail to capture contextual and relational semantics, producing captions that are visually accurate yet semantically shallow. To address the selimitations, this study proposes a hybrid architecture integrating three complementary components :(1) DenseNet-121, employee dasarobust visual encoder for hierarchical feature extraction;(2)a lightweight Knowledge Graph (KG) The framework was implemented and evaluated using the Flickr8k dataset under limited computational resources. Qualitative and quantitative results demon- strate that the hybrid model achieved notable performance gains over conventional baselines, with BLEU-4 improving from 18.2 to 20.9, METEOR from 16.5 to 18.3, and CIDEr from 47.8 to 53.5, reflecting enhanced semantic richness and contextual grounding in generated captions The findings highlight the value of incorporating structured external knowledge into vision–language models, paving the way for future work on automated knowl- edge graph construction, cross-dataset generalization, and real-time caption gener- ation. Overall, this research contributes to advancing interpretable and human-like visual description systems capable of producing contextually grounded and seman- tically meaningful captions. en_US
dc.language.iso en en_US
dc.publisher Computer Sciences en_US
dc.relation.ispartofseries MS (DS);T-3191
dc.subject Pixels to Phrases en_US
dc.subject Bridging Vision en_US
dc.subject Language en_US
dc.title Pixels to Phrases: Bridging Vision and Language en_US
dc.type MS Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account