The Role of Transformer-based Image Captioning for Indoor Environment Visual Understanding

Hatta Fudholi, Dhomas; Abida N. Nayoan, Royan

doi:https://dx.doi.org/10.12785/ijcds/120138

Journals About us Ethics and Policies Objectives Values Contact us

UOB Journals
→
02. International Journal of Computing and Digital Systems
→
Volume 12
→
Issue 01
→
View Item

dc.contributor.author	Hatta Fudholi, Dhomas
dc.contributor.author	Abida N. Nayoan, Royan
dc.date.accessioned	2022-08-06T04:02:02Z
dc.date.available	2022-08-06T04:02:02Z
dc.date.issued	2022-08-06
dc.identifier.issn	2210-142X
dc.identifier.uri	https://journal.uob.edu.bh:443/handle/123456789/4629
dc.description.abstract	Image captioning has attracted extensive attention in the field of image understanding. Image captioning has two natural parts; image and language expressions that combines computer vision and NLP to generate caption. Image captioning focuses on making the model to be able to get the description of the image as accurate as the ground-truth captions delivered by humans. Image captioning can be applied into different scenarios, such as helping the visually impaired people to get a better visual understanding of their surroundings environment through generated image caption that can be translated to speech. In this paper, we present a novel image captioning approach in Bahasa Indonesia, using Transformer, to enable visual understanding of indoor environments. We use our own modified MSCOCO dataset. Here, we used ten different indoor objects from MSCOCO datasets namely, beds, sinks, chairs, couches, tables, televisions, refrigerators, house plants, ovens, and cellphones. We modified the captions by creating three new captions in Bahasa Indonesia that includes the objects name, color, position, size, characteristics, and its close surrounding. We use Transformer architecture, which is then compared with merged encoder-decoder architecture model with different hyperparameter tunings. Both model architectures used InceptionV3 in extracting image features. The result of our experiment shows that the Transformer model with a batch size of 64, number of attention heads of 4, and a dropout of 0.2 outperforms other models with a BLEU-1 score of 0.527565, BLEU-2 score of 0.353696, BLEU-3 score of 0.227728, BLEU-4 score of 0.146192, METEOR score of 0.184714, ROUGE-L score of 0.377379, and CIDEr score of 0.393117. Finally, the inference result shows that the generated captions could give indoor environment understanding.	en_US
dc.language.iso	en	en_US
dc.publisher	University Of Bahrain	en_US
dc.subject	Image Captioning	en_US
dc.subject	Bahasa Indonesia	en_US
dc.subject	Transformer	en_US
dc.subject	Visual Understanding	en_US
dc.subject	Indoor Environment	en_US
dc.title	The Role of Transformer-based Image Captioning for Indoor Environment Visual Understanding	en_US
dc.identifier.doi	https://dx.doi.org/10.12785/ijcds/120138
dc.volume	12	en_US
dc.issue	1	en_US
dc.pagestart	479	en_US
dc.pageend	488	en_US
dc.contributor.authorcountry	Indonesia	en_US
dc.contributor.authoraffiliation	Department of Informatics, Universitas Islam Indonesia, Yogyakarta	en_US
dc.contributor.authoraffiliation	Master Program in Informatics, Universitas Islam Indonesia, Yogyakarta	en_US
dc.source.title	International Journal of Computing and Digital Systems	en_US
dc.abbreviatedsourcetitle	IJCDS	en_US