Temporal ViT-U-Net Tandem Model: Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series

Research output: Contribution to journalConference articleResearchpeer review

Authors

  • Mohammadreza Heidarianbaei
  • Hubert Kanyamahanga
  • Mareike Dorozynski
View graph of relations

Details

Original languageEnglish
Pages (from-to)169-177
Number of pages9
JournalISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
VolumeX-3-2024
Publication statusPublished - 4 Nov 2024
Event2024 Symposium on Beyond the Canopy: Technologies and Applications of Remote Sensing - Belem, Brazil
Duration: 4 Nov 20248 Nov 2024

Abstract

Semantic segmentation is essential in the field of remote sensing because it is used for various applications such as environmental monitoring and land cover classification. Recent advancements aim to collectively classify data from diverse sensors and epochs to improve predictive accuracy. With the availability of vast Satellite Image Time Series (SITS) data, supervised deep learning methods, such as Transformer models, become viable options. This paper introduces the Temporal Vision Transformer(ViT), designed to extract features from SITS. These features, capturing the temporal patterns of land cover classes, are integrated with features derived from aerial imagery to improve land cover classification. Drawing inspiration from the success of transformers in Natural language processing (NLP), Temporal ViT concurrently extracts spatial and temporal information from SITS data using tailored positional encoding strategies. The proposed approach fosters comprehensive feature learning across both domains, facilitating seamless integration of encoded data from SITS into aerial images. Furthermore, a training strategy is proposed that supports the Temporal ViT to focus on classes with a changing appearance over the year. Extensive experiments carried out in this work indicate the enhanced classification performance of Temporal ViT compared to existing state-of-the-art techniques for multi-modal land cover classification. Our model achieves a 3.8% increase in the mean IoU compared to the network solely relying on aerial images.

Keywords

    Land cover classification, Multi-Sensor remote sensing, Satellite image time series, Semantic segmentation, Vision transformer

ASJC Scopus subject areas

Cite this

Temporal ViT-U-Net Tandem Model: Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series. / Heidarianbaei, Mohammadreza; Kanyamahanga, Hubert; Dorozynski, Mareike.
In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. X-3-2024, 04.11.2024, p. 169-177.

Research output: Contribution to journalConference articleResearchpeer review

Heidarianbaei, M, Kanyamahanga, H & Dorozynski, M 2024, 'Temporal ViT-U-Net Tandem Model: Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series', ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. X-3-2024, pp. 169-177. https://doi.org/10.5194/isprs-annals-X-3-2024-169-2024
Heidarianbaei, M., Kanyamahanga, H., & Dorozynski, M. (2024). Temporal ViT-U-Net Tandem Model: Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, X-3-2024, 169-177. https://doi.org/10.5194/isprs-annals-X-3-2024-169-2024
Heidarianbaei M, Kanyamahanga H, Dorozynski M. Temporal ViT-U-Net Tandem Model: Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2024 Nov 4;X-3-2024:169-177. doi: 10.5194/isprs-annals-X-3-2024-169-2024
Heidarianbaei, Mohammadreza ; Kanyamahanga, Hubert ; Dorozynski, Mareike. / Temporal ViT-U-Net Tandem Model : Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series. In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2024 ; Vol. X-3-2024. pp. 169-177.
Download
@article{1aceb0a535d646f39c8dcd732535d3b6,
title = "Temporal ViT-U-Net Tandem Model: Enhancing Multi-Sensor Land Cover Classification Through Transformer-Based Utilization of Satellite Image Time Series",
abstract = "Semantic segmentation is essential in the field of remote sensing because it is used for various applications such as environmental monitoring and land cover classification. Recent advancements aim to collectively classify data from diverse sensors and epochs to improve predictive accuracy. With the availability of vast Satellite Image Time Series (SITS) data, supervised deep learning methods, such as Transformer models, become viable options. This paper introduces the Temporal Vision Transformer(ViT), designed to extract features from SITS. These features, capturing the temporal patterns of land cover classes, are integrated with features derived from aerial imagery to improve land cover classification. Drawing inspiration from the success of transformers in Natural language processing (NLP), Temporal ViT concurrently extracts spatial and temporal information from SITS data using tailored positional encoding strategies. The proposed approach fosters comprehensive feature learning across both domains, facilitating seamless integration of encoded data from SITS into aerial images. Furthermore, a training strategy is proposed that supports the Temporal ViT to focus on classes with a changing appearance over the year. Extensive experiments carried out in this work indicate the enhanced classification performance of Temporal ViT compared to existing state-of-the-art techniques for multi-modal land cover classification. Our model achieves a 3.8% increase in the mean IoU compared to the network solely relying on aerial images.",
keywords = "Land cover classification, Multi-Sensor remote sensing, Satellite image time series, Semantic segmentation, Vision transformer",
author = "Mohammadreza Heidarianbaei and Hubert Kanyamahanga and Mareike Dorozynski",
note = "Publisher Copyright: {\textcopyright} Author(s) 2024.; 2024 Symposium on Beyond the Canopy: Technologies and Applications of Remote Sensing ; Conference date: 04-11-2024 Through 08-11-2024",
year = "2024",
month = nov,
day = "4",
doi = "10.5194/isprs-annals-X-3-2024-169-2024",
language = "English",
volume = "X-3-2024",
pages = "169--177",

}

Download

TY - JOUR

T1 - Temporal ViT-U-Net Tandem Model

T2 - 2024 Symposium on Beyond the Canopy: Technologies and Applications of Remote Sensing

AU - Heidarianbaei, Mohammadreza

AU - Kanyamahanga, Hubert

AU - Dorozynski, Mareike

N1 - Publisher Copyright: © Author(s) 2024.

PY - 2024/11/4

Y1 - 2024/11/4

N2 - Semantic segmentation is essential in the field of remote sensing because it is used for various applications such as environmental monitoring and land cover classification. Recent advancements aim to collectively classify data from diverse sensors and epochs to improve predictive accuracy. With the availability of vast Satellite Image Time Series (SITS) data, supervised deep learning methods, such as Transformer models, become viable options. This paper introduces the Temporal Vision Transformer(ViT), designed to extract features from SITS. These features, capturing the temporal patterns of land cover classes, are integrated with features derived from aerial imagery to improve land cover classification. Drawing inspiration from the success of transformers in Natural language processing (NLP), Temporal ViT concurrently extracts spatial and temporal information from SITS data using tailored positional encoding strategies. The proposed approach fosters comprehensive feature learning across both domains, facilitating seamless integration of encoded data from SITS into aerial images. Furthermore, a training strategy is proposed that supports the Temporal ViT to focus on classes with a changing appearance over the year. Extensive experiments carried out in this work indicate the enhanced classification performance of Temporal ViT compared to existing state-of-the-art techniques for multi-modal land cover classification. Our model achieves a 3.8% increase in the mean IoU compared to the network solely relying on aerial images.

AB - Semantic segmentation is essential in the field of remote sensing because it is used for various applications such as environmental monitoring and land cover classification. Recent advancements aim to collectively classify data from diverse sensors and epochs to improve predictive accuracy. With the availability of vast Satellite Image Time Series (SITS) data, supervised deep learning methods, such as Transformer models, become viable options. This paper introduces the Temporal Vision Transformer(ViT), designed to extract features from SITS. These features, capturing the temporal patterns of land cover classes, are integrated with features derived from aerial imagery to improve land cover classification. Drawing inspiration from the success of transformers in Natural language processing (NLP), Temporal ViT concurrently extracts spatial and temporal information from SITS data using tailored positional encoding strategies. The proposed approach fosters comprehensive feature learning across both domains, facilitating seamless integration of encoded data from SITS into aerial images. Furthermore, a training strategy is proposed that supports the Temporal ViT to focus on classes with a changing appearance over the year. Extensive experiments carried out in this work indicate the enhanced classification performance of Temporal ViT compared to existing state-of-the-art techniques for multi-modal land cover classification. Our model achieves a 3.8% increase in the mean IoU compared to the network solely relying on aerial images.

KW - Land cover classification

KW - Multi-Sensor remote sensing

KW - Satellite image time series

KW - Semantic segmentation

KW - Vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85212389099&partnerID=8YFLogxK

U2 - 10.5194/isprs-annals-X-3-2024-169-2024

DO - 10.5194/isprs-annals-X-3-2024-169-2024

M3 - Conference article

AN - SCOPUS:85212389099

VL - X-3-2024

SP - 169

EP - 177

JO - ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

JF - ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

SN - 2194-9042

Y2 - 4 November 2024 through 8 November 2024

ER -