Vision Transformer Based Image Captioning for the Visually Impaired

Conference paper

Qazi, N., Dewaji, I. and Khan, N. 2025. Vision Transformer Based Image Captioning for the Visually Impaired. 14th International Conference on Human Interaction and Emerging Technologies: Artificial Intelligence & Future Applications, IHIET-FS 2025, June 10-12, 2025, University of East London, London, United Kingdom.. AHFE International. https://doi.org/10.54941/ahfe1005964

Publication dates
Authors	Qazi, N., Dewaji, I. and Khan, N.
Type	Conference paper
Abstract	Digital accessibility remains a central concern in Human-Computer Interaction (HCI), particularly for visually impaired individuals who depend on assistive technologies to interpret visual content. While image captioning systems have shown notable progress in high-resource languages, languages such as Indonesian, despite having a large speaker base, continue to be underserved. This disparity stems from the lack of annotated datasets and models that account for linguistic and cultural nuances, thereby limiting equitable access to visual information for Indonesian-speaking users. To address this gap, we present a bilingual image captioning framework aimed at improving digital accessibility for visually impaired users in the Indonesian-speaking community. We propose an end-to-end system that integrates a neural machine translation component with three deep learning-based captioning architectures: CNN-RNN, Vision Transformer with GPT-2 (ViT-GPT2), and Generative Adversarial Networks (GANs). The Flickr30k dataset was translated into Indonesian using leading machine translation models, with Google Translate achieving the highest scores across BLEU, METEOR, and ROUGE metrics. These translated captions served as training data for evaluating the image captioning models. Experimental results demonstrate that the ViT-GPT2 model outperforms the others, achieving the highest BLEU (0.2599) and ROUGE (0.3004) scores, reflecting its effectiveness in generating accurate and contextually rich captions. This work advances inclusive AI by developing culturally adaptive captioning models for underrepresented languages. By generating culturally and linguistically relevant captions for visually impaired users, the framework advances Human-Computer Interaction through more accessible and inclusive user-system communication. Beyond its technical contributions, this research addresses key challenges in Human-Computer Interaction (HCI) by enabling inclusive, multilingual assistive technologies. It supports the evolution of Next-Generation Work environments by equipping visually impaired individuals with tools to independently interpret visual information, an increasingly essential capability in AI-rich, visually oriented digital workspaces. In future work, the framework will be enhanced through multimodal pretraining and the integration of culturally enriched datasets, aiming to improve semantic accuracy and broaden its applicability to a wider range of linguistic communities.
Year	2025
Conference	14th International Conference on Human Interaction and Emerging Technologies: Artificial Intelligence & Future Applications, IHIET-FS 2025, June 10-12, 2025, University of East London, London, United Kingdom.
Publisher	AHFE International
Accepted author manuscript	License All rights reserved File Access Level Registered users only
Publisher's version	978-1-964867-72-4_15.pdf License CC BY-NC-SA 4.0 File Access Level Anyone
Online	16 Jun 2025
Publication process dates
Deposited	17 Jun 2025
Journal citation	196, pp. 153-162
ISSN	2771-0718
Book title	Human Interaction and Emerging Technologies (IHIET-FS 2025): Future Systems and Artificial Intelligence Applications
Book editor	Ahram, T.
	Arewa, A.
	Ghorashi, S.
ISBN	978-1-964867-72-4
Digital Object Identifier (DOI)	https://doi.org/10.54941/ahfe1005964
Web address (URL) of conference proceedings	https://openaccess.cms-conferences.org/publications/book/978-1-964867-72-4
Copyright holder	© 2025 The Authors

Permalink -

https://repository.uel.ac.uk/item/8zv9y

Download files

Publisher's version

	978-1-964867-72-4_15.pdf
License: CC BY-NC-SA 4.0
File access level: Anyone

File access - Log in to download

1233
total views
67
total downloads
31
views this month
0
downloads this month

Export as

Related outputs

Unveiling the Power of Hybrid Balancing Techniques and Ensemble Stacked and Blended Classifiers for Enhanced Churn Prediction

Gaikwad, K., Berardinelli, N. and Qazi, N. 2024. Unveiling the Power of Hybrid Balancing Techniques and Ensemble Stacked and Blended Classifiers for Enhanced Churn Prediction. 16th Asian Conference on Intelligent Information and Database Systems. UAE 15 Apr 2024 - 18 Jun 2025 Springer. https://doi.org/10.1007/978-981-97-5937-8_20

Enhancing Authenticity Verification with Transfer Learning and Ensemble Techniques in Facial Feature-Based Deepfake Detection

Qazi, N. and Ahmed, I. 2024. Enhancing Authenticity Verification with Transfer Learning and Ensemble Techniques in Facial Feature-Based Deepfake Detection. 14th International Conference on Pattern Recognition Systems (ICPRS). London 15 - 18 Jul 2024 IEEE. https://doi.org/10.1109/ICPRS62101.2024.10677831

A reinforcement learning recommender system using bi-clustering and Markov Decision Process

Iftikhar, A., Ghazanfar, M. A., Ayub, M., Alahmari, S. A., Qazi, N. and Wall, J. 2024. A reinforcement learning recommender system using bi-clustering and Markov Decision Process. Expert Systems with Applications. 237 (Art.), p. 121541. https://doi.org/10.1016/j.eswa.2023.121541

Shifting the Weight: Applications of AI in Olympic Weightlifting

Bolarinwa, D., Qazi, N. and Ghazanfar, M. 2023. Shifting the Weight: Applications of AI in Olympic Weightlifting. PRDC 2023: 28th IEEE Pacific Rim International Symposium on Dependable Computing. Singapore 24 - 27 Oct 2023 IEEE. https://doi.org/10.1109/PRDC59308.2023.00051

Global impact of COVID-19 on surgeons and team members (GlobalCOST): a cross-sectional study

Jaffry, Z., Raj, S., Sallam, A., Lyman, S., Negida, A., Yiu, C. F. A., Sobti, A., Bua, N., Field, R. E., Abdalla, H., Hammad, R., Qazi, N., Singh, B., Brennan, P. A., Hussein, A., Narvani, A., Jones, A., Imam, M. A. and The OrthoGlobe Collaborative 2022. Global impact of COVID-19 on surgeons and team members (GlobalCOST): a cross-sectional study. BMJ Open. 12 (8), p. e059873. https://doi.org/10.1136/bmjopen-2021-059873

An interactive human centered data science approach towards crime pattern analysis

Qazi, N. and William Wong, B. L. 2019. An interactive human centered data science approach towards crime pattern analysis. Information Processing & Management. 56 (6), p. Art. 102066. https://doi.org/10.1016/j.ipm.2019.102066

Contextual Visualization of Crime Matching Through Interactive Clustering and Bayesian Theory

Qazi, N. and William Wong, B. L. 2019. Contextual Visualization of Crime Matching Through Interactive Clustering and Bayesian Theory. in: Akhgar, B., Bayerl, P. S. and Leventakis, G. (ed.) Social Media Strategy in Policing: From Cultural Intelligence to Community Policing Springer. pp. 197–215

Associative search through Formal Concept Analysis in Criminal Intelligence Analysis

Qazi, N., William Wong, B. L., Kodagoda, N. and Adderley, R. 2017. Associative search through Formal Concept Analysis in Criminal Intelligence Analysis. IEEE International Conference on Systems, Man, and Cybernetics (SMC 2016) . IEEE. https://doi.org/10.1109/SMC.2016.7844519

Behavioural Tempo-spatial Knowledge Graph for Crime matching through Associate Questioning and Graph Theory

Qazi, N. and William Wong, B. L. 2017. Behavioural Tempo-spatial Knowledge Graph for Crime matching through Associate Questioning and Graph Theory. 2017 European Intelligence and Security Informatics Conference (EISIC). IEEE. https://doi.org/10.1109/EISIC.2017.29

Semantic-Based Image Retrieval Through Combined Classifiers of Deep Neural Network and Wavelet Decomposition of Image Signal

Qazi, N. and William Wong, B. L. 2016. Semantic-Based Image Retrieval Through Combined Classifiers of Deep Neural Network and Wavelet Decomposition of Image Signal. Proceedings of The 9th EUROSIM Congress on Modelling and Simulation, EUROSIM 2016. FinLand Scandinavian Simulation Society and Linköping University Electronic Press. https://doi.org/10.3384/ecp17142473

Vision Transformer Based Image Captioning for the Visually Impaired

Download files

Publisher's version

1233

67

31

0

Export as

Related outputs