University of Bahrain
Scientific Journals

The Role of Transformer-based Image Captioning for Indoor Environment Visual Understanding

Show simple item record

dc.contributor.author Hatta Fudholi, Dhomas
dc.contributor.author Abida N. Nayoan, Royan
dc.date.accessioned 2022-08-06T04:02:02Z
dc.date.available 2022-08-06T04:02:02Z
dc.date.issued 2022-08-06
dc.identifier.issn 2210-142X
dc.identifier.uri https://journal.uob.edu.bh:443/handle/123456789/4629
dc.description.abstract Image captioning has attracted extensive attention in the field of image understanding. Image captioning has two natural parts; image and language expressions that combines computer vision and NLP to generate caption. Image captioning focuses on making the model to be able to get the description of the image as accurate as the ground-truth captions delivered by humans. Image captioning can be applied into different scenarios, such as helping the visually impaired people to get a better visual understanding of their surroundings environment through generated image caption that can be translated to speech. In this paper, we present a novel image captioning approach in Bahasa Indonesia, using Transformer, to enable visual understanding of indoor environments. We use our own modified MSCOCO dataset. Here, we used ten different indoor objects from MSCOCO datasets namely, beds, sinks, chairs, couches, tables, televisions, refrigerators, house plants, ovens, and cellphones. We modified the captions by creating three new captions in Bahasa Indonesia that includes the objects name, color, position, size, characteristics, and its close surrounding. We use Transformer architecture, which is then compared with merged encoder-decoder architecture model with different hyperparameter tunings. Both model architectures used InceptionV3 in extracting image features. The result of our experiment shows that the Transformer model with a batch size of 64, number of attention heads of 4, and a dropout of 0.2 outperforms other models with a BLEU-1 score of 0.527565, BLEU-2 score of 0.353696, BLEU-3 score of 0.227728, BLEU-4 score of 0.146192, METEOR score of 0.184714, ROUGE-L score of 0.377379, and CIDEr score of 0.393117. Finally, the inference result shows that the generated captions could give indoor environment understanding. en_US
dc.language.iso en en_US
dc.publisher University Of Bahrain en_US
dc.subject Image Captioning en_US
dc.subject Bahasa Indonesia en_US
dc.subject Transformer en_US
dc.subject Visual Understanding en_US
dc.subject Indoor Environment en_US
dc.title The Role of Transformer-based Image Captioning for Indoor Environment Visual Understanding en_US
dc.identifier.doi https://dx.doi.org/10.12785/ijcds/120138
dc.volume 12 en_US
dc.issue 1 en_US
dc.pagestart 479 en_US
dc.pageend 488 en_US
dc.contributor.authorcountry Indonesia en_US
dc.contributor.authoraffiliation Department of Informatics, Universitas Islam Indonesia, Yogyakarta en_US
dc.contributor.authoraffiliation Master Program in Informatics, Universitas Islam Indonesia, Yogyakarta en_US
dc.source.title International Journal of Computing and Digital Systems en_US
dc.abbreviatedsourcetitle IJCDS en_US


Files in this item

This item appears in the following Issue(s)

Show simple item record

All Journals


Advanced Search

Browse

Administrator Account