Image Captioning through Image Transformer

Sen He; Wentong Liao; Hamed R. Tavakoli; Michael Yang; Bodo Rosenhahn; Nicolas Pugeault

doi:10.1007/978-3-030-69538-5_10

Details

Original language	English
Title of host publication	Computer Vision – ACCV 2020
Subtitle of host publication	15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV
Editors	Hiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, Jianbo Shi
Pages	153-169
Number of pages	17
ISBN (electronic)	978-3-030-69538-5
Publication status	Published - 2021

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12625 LNCS
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.

Keywords

cs.CV

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
General Computer Science

Cite this

Image Captioning through Image Transformer. / He, Sen; Liao, Wentong; Tavakoli, Hamed R. et al.
Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV. ed. / Hiroshi Ishikawa; Cheng-Lin Liu; Tomas Pajdla; Jianbo Shi. 2021. p. 153-169 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12625 LNCS).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research

He, S, Liao, W, Tavakoli, HR, Yang, M, Rosenhahn, B & Pugeault, N 2021, Image Captioning through Image Transformer. in H Ishikawa, C-L Liu, T Pajdla & J Shi (eds), Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12625 LNCS, pp. 153-169. https://doi.org/10.1007/978-3-030-69538-5_10

He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., & Pugeault, N. (2021). Image Captioning through Image Transformer. In H. Ishikawa, C.-L. Liu, T. Pajdla, & J. Shi (Eds.), Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV (pp. 153-169). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12625 LNCS). https://doi.org/10.1007/978-3-030-69538-5_10

He S, Liao W, Tavakoli HR, Yang M, Rosenhahn B, Pugeault N. Image Captioning through Image Transformer. In Ishikawa H, Liu CL, Pajdla T, Shi J, editors, Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV. 2021. p. 153-169. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). Epub 2021 Feb 25. doi: 10.1007/978-3-030-69538-5_10

He, Sen ; Liao, Wentong ; Tavakoli, Hamed R. et al. / Image Captioning through Image Transformer. Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV. editor / Hiroshi Ishikawa ; Cheng-Lin Liu ; Tomas Pajdla ; Jianbo Shi. 2021. pp. 153-169 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

Download

@inproceedings{c5e858da74754c46ad197f9d8a6003bb,

title = "Image Captioning through Image Transformer",

abstract = "Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks. ",

keywords = "cs.CV",

author = "Sen He and Wentong Liao and Tavakoli, {Hamed R.} and Michael Yang and Bodo Rosenhahn and Nicolas Pugeault",

year = "2021",

doi = "10.1007/978-3-030-69538-5_10",

language = "English",

isbn = "978-3-030-69537-8",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

pages = "153--169",

editor = "Hiroshi Ishikawa and Cheng-Lin Liu and Tomas Pajdla and Shi, {Jianbo }",

booktitle = "Computer Vision – ACCV 2020",

}

Download

TY - GEN

T1 - Image Captioning through Image Transformer

AU - He, Sen

AU - Liao, Wentong

AU - Tavakoli, Hamed R.

AU - Yang, Michael

AU - Rosenhahn, Bodo

AU - Pugeault, Nicolas

PY - 2021

Y1 - 2021

N2 - Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.

AB - Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.

KW - cs.CV

UR - http://www.scopus.com/inward/record.url?scp=85103275378&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-69538-5_10

DO - 10.1007/978-3-030-69538-5_10

M3 - Conference contribution

SN - 978-3-030-69537-8

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 153

EP - 169

BT - Computer Vision – ACCV 2020

A2 - Ishikawa, Hiroshi

A2 - Liu, Cheng-Lin

A2 - Pajdla, Tomas

A2 - Shi, Jianbo

ER -

Research@Leibniz University

Image Captioning through Image Transformer

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

Quantized Inverse Design for Photonic Integrated Circuits

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Guest Editorial: Special Issue on Multimodal Learning

A flexible framework for large-scale FDTD simulations: open-source inverse design for 3D nanostructures

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

Quantized Inverse Design for Photonic Integrated Circuits

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Guest Editorial: Special Issue on Multimodal Learning

A flexible framework for large-scale FDTD simulations: open-source inverse design for 3D nanostructures

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery