Pixels to Phrases: Bridging Vision and Language

Zahid Zaman, 01-249231-020

DSpace Home
→
Thesis/Dissertation Repository Islamabad Campus
→
Department of Computer Sciences (BUIC-E-8)
→
MS (DS) (BUIC-E-8)
→
View Item

Pixels to Phrases: Bridging Vision and Language

Zahid Zaman, 01-249231-020

URI: http://hdl.handle.net/123456789/20725

Date: 2025

Abstract:

This thesis presents a context-aware image captioning framework that bridges the gap between visual perception and natural language generation by combining deep learning with structured semantic reasoning. While conventional captioning models primarily rely on convolutional neural networks (CNNs) to extract image features, they often fail to capture contextual and relational semantics, producing captions that are visually accurate yet semantically shallow. To address the selimitations, this study proposes a hybrid architecture integrating three complementary components :(1) DenseNet-121, employee dasarobust visual encoder for hierarchical feature extraction;(2)a lightweight Knowledge Graph (KG) The framework was implemented and evaluated using the Flickr8k dataset under limited computational resources. Qualitative and quantitative results demon- strate that the hybrid model achieved notable performance gains over conventional baselines, with BLEU-4 improving from 18.2 to 20.9, METEOR from 16.5 to 18.3, and CIDEr from 47.8 to 53.5, reflecting enhanced semantic richness and contextual grounding in generated captions The findings highlight the value of incorporating structured external knowledge into vision–language models, paving the way for future work on automated knowl- edge graph construction, cross-dataset generalization, and real-time caption gener- ation. Overall, this research contributes to advancing interpretable and human-like visual description systems capable of producing contextually grounded and seman- tically meaningful captions.