Pixels to Phrases: Bridging Vision and Language

Zahid Zaman, 01-249231-020

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (DS) (BUIC-E-8)
→
View Item

dc.contributor.author	Zahid Zaman, 01-249231-020
dc.date.accessioned	2026-02-25T05:14:45Z
dc.date.available	2026-02-25T05:14:45Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/123456789/20725
dc.description	Supervised by Dr. Arif Ur Rahman	en_US
dc.description.abstract	This thesis presents a context-aware image captioning framework that bridges the gap between visual perception and natural language generation by combining deep learning with structured semantic reasoning. While conventional captioning models primarily rely on convolutional neural networks (CNNs) to extract image features, they often fail to capture contextual and relational semantics, producing captions that are visually accurate yet semantically shallow. To address the selimitations, this study proposes a hybrid architecture integrating three complementary components :(1) DenseNet-121, employee dasarobust visual encoder for hierarchical feature extraction;(2)a lightweight Knowledge Graph (KG) The framework was implemented and evaluated using the Flickr8k dataset under limited computational resources. Qualitative and quantitative results demon- strate that the hybrid model achieved notable performance gains over conventional baselines, with BLEU-4 improving from 18.2 to 20.9, METEOR from 16.5 to 18.3, and CIDEr from 47.8 to 53.5, reflecting enhanced semantic richness and contextual grounding in generated captions The findings highlight the value of incorporating structured external knowledge into vision–language models, paving the way for future work on automated knowl- edge graph construction, cross-dataset generalization, and real-time caption gener- ation. Overall, this research contributes to advancing interpretable and human-like visual description systems capable of producing contextually grounded and seman- tically meaningful captions.	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Sciences	en_US
dc.relation.ispartofseries	MS (DS);T-3191
dc.subject	Pixels to Phrases	en_US
dc.subject	Bridging Vision	en_US
dc.subject	Language	en_US
dc.title	Pixels to Phrases: Bridging Vision and Language	en_US
dc.type	MS Thesis	en_US