Vision Transformer Based Image Captioning for the Visually Impaired

Conference paper


Qazi, N., Dewaji, I. and Khan, N. 2025. Vision Transformer Based Image Captioning for the Visually Impaired. 14th International Conference on Human Interaction and Emerging Technologies: Artificial Intelligence & Future Applications, IHIET-FS 2025, June 10-12, 2025, University of East London, London, United Kingdom.. AHFE International. https://doi.org/10.54941/ahfe1005964
AuthorsQazi, N., Dewaji, I. and Khan, N.
TypeConference paper
Abstract

Digital accessibility remains a central concern in Human-Computer Interaction (HCI), particularly for visually impaired individuals who depend on assistive technologies to interpret visual content. While image captioning systems have shown notable progress in high-resource languages, languages such as Indonesian, despite having a large speaker base, continue to be underserved. This disparity stems from the lack of annotated datasets and models that account for linguistic and cultural nuances, thereby limiting equitable access to visual information for Indonesian-speaking users. To address this gap, we present a bilingual image captioning framework aimed at improving digital accessibility for visually impaired users in the Indonesian-speaking community. We propose an end-to-end system that integrates a neural machine translation component with three deep learning-based captioning architectures: CNN-RNN, Vision Transformer with GPT-2 (ViT-GPT2), and Generative Adversarial Networks (GANs). The Flickr30k dataset was translated into Indonesian using leading machine translation models, with Google Translate achieving the highest scores across BLEU, METEOR, and ROUGE metrics. These translated captions served as training data for evaluating the image captioning models. Experimental results demonstrate that the ViT-GPT2 model outperforms the others, achieving the highest BLEU (0.2599) and ROUGE (0.3004) scores, reflecting its effectiveness in generating accurate and contextually rich captions. This work advances inclusive AI by developing culturally adaptive captioning models for underrepresented languages. By generating culturally and linguistically relevant captions for visually impaired users, the framework advances Human-Computer Interaction through more accessible and inclusive user-system communication. Beyond its technical contributions, this research addresses key challenges in Human-Computer Interaction (HCI) by enabling inclusive, multilingual assistive technologies. It supports the evolution of Next-Generation Work environments by equipping visually impaired individuals with tools to independently interpret visual information, an increasingly essential capability in AI-rich, visually oriented digital workspaces. In future work, the framework will be enhanced through multimodal pretraining and the integration of culturally enriched datasets, aiming to improve semantic accuracy and broaden its applicability to a wider range of linguistic communities.

Year2025
Conference14th International Conference on Human Interaction and Emerging Technologies: Artificial Intelligence & Future Applications, IHIET-FS 2025, June 10-12, 2025, University of East London, London, United Kingdom.
PublisherAHFE International
Accepted author manuscript
License
File Access Level
Registered users only
Publisher's version
License
File Access Level
Anyone
Publication dates
Online16 Jun 2025
Publication process dates
Deposited17 Jun 2025
Journal citation196, pp. 153-162
ISSN2771-0718
Book titleHuman Interaction and Emerging Technologies (IHIET-FS 2025): Future Systems and Artificial Intelligence Applications
Book editorAhram, T.
Arewa, A.
Ghorashi, S.
ISBN978-1-964867-72-4
Digital Object Identifier (DOI)https://doi.org/10.54941/ahfe1005964
Web address (URL) of conference proceedingshttps://openaccess.cms-conferences.org/publications/book/978-1-964867-72-4
Copyright holder© 2025 The Authors
Permalink -

https://repository.uel.ac.uk/item/8zv9y

Download files


Publisher's version
978-1-964867-72-4_15.pdf
License: CC BY-NC-SA 4.0
File access level: Anyone

  • 8
    total views
  • 1
    total downloads
  • 8
    views this month
  • 1
    downloads this month

Export as

Related outputs

Unveiling the Power of Hybrid Balancing Techniques and Ensemble Stacked and Blended Classifiers for Enhanced Churn Prediction
Gaikwad, K., Berardinelli, N. and Qazi, N. 2024. Unveiling the Power of Hybrid Balancing Techniques and Ensemble Stacked and Blended Classifiers for Enhanced Churn Prediction. 16th Asian Conference on Intelligent Information and Database Systems. UAE 15 Apr 2024 - 18 Jun 2025 Springer. https://doi.org/10.1007/978-981-97-5937-8_20
A reinforcement learning recommender system using bi-clustering and Markov Decision Process
Iftikhar, A., Ghazanfar, M. A., Ayub, M., Alahmari, S. A., Qazi, N. and Wall, J. 2024. A reinforcement learning recommender system using bi-clustering and Markov Decision Process. Expert Systems with Applications. 237 (Art.), p. 121541. https://doi.org/10.1016/j.eswa.2023.121541
Shifting the Weight: Applications of AI in Olympic Weightlifting
Bolarinwa, D., Qazi, N. and Ghazanfar, M. 2023. Shifting the Weight: Applications of AI in Olympic Weightlifting. PRDC 2023: 28th IEEE Pacific Rim International Symposium on Dependable Computing. Singapore 24 - 27 Oct 2023 IEEE. https://doi.org/10.1109/PRDC59308.2023.00051
Global impact of COVID-19 on surgeons and team members (GlobalCOST): a cross-sectional study
Jaffry, Z., Raj, S., Sallam, A., Lyman, S., Negida, A., Yiu, C. F. A., Sobti, A., Bua, N., Field, R. E., Abdalla, H., Hammad, R., Qazi, N., Singh, B., Brennan, P. A., Hussein, A., Narvani, A., Jones, A., Imam, M. A. and The OrthoGlobe Collaborative 2022. Global impact of COVID-19 on surgeons and team members (GlobalCOST): a cross-sectional study. BMJ Open. 12 (8), p. e059873. https://doi.org/10.1136/bmjopen-2021-059873