Vision Transformer Based Image Captioning for the Visually Impaired
Conference paper
Qazi, N., Dewaji, I. and Khan, N. 2025. Vision Transformer Based Image Captioning for the Visually Impaired. 14th International Conference on Human Interaction and Emerging Technologies: Artificial Intelligence & Future Applications, IHIET-FS 2025, June 10-12, 2025, University of East London, London, United Kingdom.. AHFE International. https://doi.org/10.54941/ahfe1005964
Authors | Qazi, N., Dewaji, I. and Khan, N. |
---|---|
Type | Conference paper |
Abstract | Digital accessibility remains a central concern in Human-Computer Interaction (HCI), particularly for visually impaired individuals who depend on assistive technologies to interpret visual content. While image captioning systems have shown notable progress in high-resource languages, languages such as Indonesian, despite having a large speaker base, continue to be underserved. This disparity stems from the lack of annotated datasets and models that account for linguistic and cultural nuances, thereby limiting equitable access to visual information for Indonesian-speaking users. To address this gap, we present a bilingual image captioning framework aimed at improving digital accessibility for visually impaired users in the Indonesian-speaking community. We propose an end-to-end system that integrates a neural machine translation component with three deep learning-based captioning architectures: CNN-RNN, Vision Transformer with GPT-2 (ViT-GPT2), and Generative Adversarial Networks (GANs). The Flickr30k dataset was translated into Indonesian using leading machine translation models, with Google Translate achieving the highest scores across BLEU, METEOR, and ROUGE metrics. These translated captions served as training data for evaluating the image captioning models. Experimental results demonstrate that the ViT-GPT2 model outperforms the others, achieving the highest BLEU (0.2599) and ROUGE (0.3004) scores, reflecting its effectiveness in generating accurate and contextually rich captions. This work advances inclusive AI by developing culturally adaptive captioning models for underrepresented languages. By generating culturally and linguistically relevant captions for visually impaired users, the framework advances Human-Computer Interaction through more accessible and inclusive user-system communication. Beyond its technical contributions, this research addresses key challenges in Human-Computer Interaction (HCI) by enabling inclusive, multilingual assistive technologies. It supports the evolution of Next-Generation Work environments by equipping visually impaired individuals with tools to independently interpret visual information, an increasingly essential capability in AI-rich, visually oriented digital workspaces. In future work, the framework will be enhanced through multimodal pretraining and the integration of culturally enriched datasets, aiming to improve semantic accuracy and broaden its applicability to a wider range of linguistic communities. |
Year | 2025 |
Conference | 14th International Conference on Human Interaction and Emerging Technologies: Artificial Intelligence & Future Applications, IHIET-FS 2025, June 10-12, 2025, University of East London, London, United Kingdom. |
Publisher | AHFE International |
Accepted author manuscript | License File Access Level Registered users only |
Publisher's version | License File Access Level Anyone |
Publication dates | |
Online | 16 Jun 2025 |
Publication process dates | |
Deposited | 17 Jun 2025 |
Journal citation | 196, pp. 153-162 |
ISSN | 2771-0718 |
Book title | Human Interaction and Emerging Technologies (IHIET-FS 2025): Future Systems and Artificial Intelligence Applications |
Book editor | Ahram, T. |
Arewa, A. | |
Ghorashi, S. | |
ISBN | 978-1-964867-72-4 |
Digital Object Identifier (DOI) | https://doi.org/10.54941/ahfe1005964 |
Web address (URL) of conference proceedings | https://openaccess.cms-conferences.org/publications/book/978-1-964867-72-4 |
Copyright holder | © 2025 The Authors |
https://repository.uel.ac.uk/item/8zv9y
Download files
8
total views1
total downloads8
views this month1
downloads this month