Loading [MathJax]/extensions/tex2jax.js

Text to Image Generation with Semantic-Spatial Aware GAN

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autorschaft

Externe Organisationen

  • University of Twente

Details

OriginalspracheEnglisch
Titel des SammelwerksProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers Inc.
Seiten18166-18175
Seitenumfang10
ISBN (elektronisch)978-1-6654-6946-3
ISBN (Print)978-1-6654-6947-0
PublikationsstatusVeröffentlicht - 2022

Publikationsreihe

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Band2022-June
ISSN (Print)1063-6919

Abstract

A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

ASJC Scopus Sachgebiete

Zitieren

Text to Image Generation with Semantic-Spatial Aware GAN. / Liao, Wentong; Hu, Kai; Yang, Michael Ying et al.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. S. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Band 2022-June).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Liao, W, Hu, K, Yang, MY & Rosenhahn, B 2022, Text to Image Generation with Semantic-Spatial Aware GAN. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Bd. 2022-June, Institute of Electrical and Electronics Engineers Inc., S. 18166-18175. https://doi.org/10.1109/CVPR52688.2022.01765
Liao, W., Hu, K., Yang, M. Y., & Rosenhahn, B. (2022). Text to Image Generation with Semantic-Spatial Aware GAN. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (S. 18166-18175). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Band 2022-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CVPR52688.2022.01765
Liao W, Hu K, Yang MY, Rosenhahn B. Text to Image Generation with Semantic-Spatial Aware GAN. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc. 2022. S. 18166-18175. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52688.2022.01765
Liao, Wentong ; Hu, Kai ; Yang, Michael Ying et al. / Text to Image Generation with Semantic-Spatial Aware GAN. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. S. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).
Download
@inproceedings{1a43e4964fa449d19bcbed27698416e1,
title = "Text to Image Generation with Semantic-Spatial Aware GAN",
abstract = "A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image. ",
keywords = "cs.CV, cs.LG, Image and video synthesis and generation, Vision + language",
author = "Wentong Liao and Kai Hu and Yang, {Michael Ying} and Bodo Rosenhahn",
note = "Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany{\textquoteright}s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).",
year = "2022",
doi = "10.1109/CVPR52688.2022.01765",
language = "English",
isbn = "978-1-6654-6947-0",
series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "18166--18175",
booktitle = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
address = "United States",

}

Download

TY - GEN

T1 - Text to Image Generation with Semantic-Spatial Aware GAN

AU - Liao, Wentong

AU - Hu, Kai

AU - Yang, Michael Ying

AU - Rosenhahn, Bodo

N1 - Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

PY - 2022

Y1 - 2022

N2 - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

AB - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

KW - cs.CV

KW - cs.LG

KW - Image and video synthesis and generation

KW - Vision + language

UR - http://www.scopus.com/inward/record.url?scp=85139192930&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01765

DO - 10.1109/CVPR52688.2022.01765

M3 - Conference contribution

SN - 978-1-6654-6947-0

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 18166

EP - 18175

BT - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Von denselben Autoren