Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images

S. Abualhanud; E. Erahan; M. Mehltretter

doi:10.1007/s41064-024-00308-9

Details

Originalsprache	Englisch
Seiten (von - bis)	483-498
Seitenumfang	16
Fachzeitschrift	PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science
Jahrgang	92
Ausgabenummer	5
Frühes Online-Datum	18 Sept. 2024
Publikationsstatus	Veröffentlicht - Okt. 2024

Abstract

An accurate 3D representation of the geometry and semantics of an environment builds the basis for a large variety of downstream tasks and is essential for autonomous driving related tasks such as path planning and obstacle avoidance. The focus of this work is put on 3D semantic occupancy prediction, i.e., the reconstruction of a scene as a voxel grid where each voxel is assigned both an occupancy and a semantic label. We present a Convolutional Neural Network-based method that utilizes multiple color images from a surround-view setup with minimal overlap, together with the associated interior and exterior camera parameters as input, to reconstruct the observed environment as a 3D semantic occupancy map. To account for the ill-posed nature of reconstructing a 3D representation from monocular 2D images, the image information is integrated over time: Under the assumption that the camera setup is moving, images from consecutive time steps are used to form a multi-view stereo setup. In exhaustive experiments, we investigate the challenges presented by dynamic objects and the possibilities of training the proposed method with either 3D or 2D reference data. Latter being motivated by the comparably higher costs of generating and annotating 3D ground truth data. Moreover, we present and investigate a novel self-supervised training scheme that does not require any geometric reference data, but only relies on sparse semantic ground truth. An evaluation on the Occ3D dataset, including a comparison against current state-of-the-art self-supervised methods from the literature, demonstrates the potential of our self-supervised variant.

ASJC Scopus Sachgebiete

Sozialwissenschaften (insg.)
Geografie, Planung und Entwicklung
Physik und Astronomie (insg.)
Instrumentierung
Erdkunde und Planetologie (insg.)
Erdkunde und Planetologie (sonstige)

Zitieren

Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images. / Abualhanud, S.; Erahan, E.; Mehltretter, M.
in: PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science, Jahrgang 92, Nr. 5, 10.2024, S. 483-498.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Abualhanud, S, Erahan, E & Mehltretter, M 2024, 'Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images', PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science, Jg. 92, Nr. 5, S. 483-498. https://doi.org/10.1007/s41064-024-00308-9

Abualhanud, S., Erahan, E., & Mehltretter, M. (2024). Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images. PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science, 92(5), 483-498. https://doi.org/10.1007/s41064-024-00308-9

Abualhanud S, Erahan E, Mehltretter M. Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images. PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science. 2024 Okt;92(5):483-498. Epub 2024 Sep 18. doi: 10.1007/s41064-024-00308-9

Abualhanud, S. ; Erahan, E. ; Mehltretter, M. / Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images. in: PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science. 2024 ; Jahrgang 92, Nr. 5. S. 483-498.

Download

@article{2ef15dc9ef6c4e34a46c0661b9bdc1e3,

title = "Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images",

abstract = "An accurate 3D representation of the geometry and semantics of an environment builds the basis for a large variety of downstream tasks and is essential for autonomous driving related tasks such as path planning and obstacle avoidance. The focus of this work is put on 3D semantic occupancy prediction, i.e., the reconstruction of a scene as a voxel grid where each voxel is assigned both an occupancy and a semantic label. We present a Convolutional Neural Network-based method that utilizes multiple color images from a surround-view setup with minimal overlap, together with the associated interior and exterior camera parameters as input, to reconstruct the observed environment as a 3D semantic occupancy map. To account for the ill-posed nature of reconstructing a 3D representation from monocular 2D images, the image information is integrated over time: Under the assumption that the camera setup is moving, images from consecutive time steps are used to form a multi-view stereo setup. In exhaustive experiments, we investigate the challenges presented by dynamic objects and the possibilities of training the proposed method with either 3D or 2D reference data. Latter being motivated by the comparably higher costs of generating and annotating 3D ground truth data. Moreover, we present and investigate a novel self-supervised training scheme that does not require any geometric reference data, but only relies on sparse semantic ground truth. An evaluation on the Occ3D dataset, including a comparison against current state-of-the-art self-supervised methods from the literature, demonstrates the potential of our self-supervised variant.",

keywords = "3D Occupancy Prediction, 3D Perception, NeRF, Semantic Scene Completion",

author = "S. Abualhanud and E. Erahan and M. Mehltretter",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2024.",

year = "2024",

month = oct,

doi = "10.1007/s41064-024-00308-9",

language = "English",

volume = "92",

pages = "483--498",

number = "5",

}

Download

TY - JOUR

T1 - Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images

AU - Abualhanud, S.

AU - Erahan, E.

AU - Mehltretter, M.

N1 - Publisher Copyright: © The Author(s) 2024.

PY - 2024/10

Y1 - 2024/10

N2 - An accurate 3D representation of the geometry and semantics of an environment builds the basis for a large variety of downstream tasks and is essential for autonomous driving related tasks such as path planning and obstacle avoidance. The focus of this work is put on 3D semantic occupancy prediction, i.e., the reconstruction of a scene as a voxel grid where each voxel is assigned both an occupancy and a semantic label. We present a Convolutional Neural Network-based method that utilizes multiple color images from a surround-view setup with minimal overlap, together with the associated interior and exterior camera parameters as input, to reconstruct the observed environment as a 3D semantic occupancy map. To account for the ill-posed nature of reconstructing a 3D representation from monocular 2D images, the image information is integrated over time: Under the assumption that the camera setup is moving, images from consecutive time steps are used to form a multi-view stereo setup. In exhaustive experiments, we investigate the challenges presented by dynamic objects and the possibilities of training the proposed method with either 3D or 2D reference data. Latter being motivated by the comparably higher costs of generating and annotating 3D ground truth data. Moreover, we present and investigate a novel self-supervised training scheme that does not require any geometric reference data, but only relies on sparse semantic ground truth. An evaluation on the Occ3D dataset, including a comparison against current state-of-the-art self-supervised methods from the literature, demonstrates the potential of our self-supervised variant.

AB - An accurate 3D representation of the geometry and semantics of an environment builds the basis for a large variety of downstream tasks and is essential for autonomous driving related tasks such as path planning and obstacle avoidance. The focus of this work is put on 3D semantic occupancy prediction, i.e., the reconstruction of a scene as a voxel grid where each voxel is assigned both an occupancy and a semantic label. We present a Convolutional Neural Network-based method that utilizes multiple color images from a surround-view setup with minimal overlap, together with the associated interior and exterior camera parameters as input, to reconstruct the observed environment as a 3D semantic occupancy map. To account for the ill-posed nature of reconstructing a 3D representation from monocular 2D images, the image information is integrated over time: Under the assumption that the camera setup is moving, images from consecutive time steps are used to form a multi-view stereo setup. In exhaustive experiments, we investigate the challenges presented by dynamic objects and the possibilities of training the proposed method with either 3D or 2D reference data. Latter being motivated by the comparably higher costs of generating and annotating 3D ground truth data. Moreover, we present and investigate a novel self-supervised training scheme that does not require any geometric reference data, but only relies on sparse semantic ground truth. An evaluation on the Occ3D dataset, including a comparison against current state-of-the-art self-supervised methods from the literature, demonstrates the potential of our self-supervised variant.

KW - 3D Occupancy Prediction

KW - 3D Perception

KW - NeRF

KW - Semantic Scene Completion

UR - http://www.scopus.com/inward/record.url?scp=85204175168&partnerID=8YFLogxK

U2 - 10.1007/s41064-024-00308-9

DO - 10.1007/s41064-024-00308-9

M3 - Article

AN - SCOPUS:85204175168

VL - 92

SP - 483

EP - 498

JO - PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science

JF - PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science

SN - 2512-2789

IS - 5

ER -

Research@Leibniz University

Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Editorial for Special Issue: 75 Years IPI—an Overview of Current Research Activities in Photogrammetry and Remote Sensing

Fresh Concrete Properties from Stereoscopic Image Sequences

Monocular Pose and Shape Reconstruction of Vehicles in UAV imagery using a Multi-task CNN

Uncertainty Estimation and Out-of-Distribution Detection for LiDAR Scene Semantic Segmentation

Cooperative Image Orientation with Dynamic Objects

Editorial for Special Issue: 75 Years IPI—an Overview of Current Research Activities in Photogrammetry and Remote Sensing

Fresh Concrete Properties from Stereoscopic Image Sequences

Monocular Pose and Shape Reconstruction of Vehicles in UAV imagery using a Multi-task CNN

Uncertainty Estimation and Out-of-Distribution Detection for LiDAR Scene Semantic Segmentation

Cooperative Image Orientation with Dynamic Objects

Editorial for Special Issue: 75 Years IPI—an Overview of Current Research Activities in Photogrammetry and Remote Sensing