Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization

Nils Poschadel; Stephan Preihs; Jürgen Peissig

doi:10.1186/s13636-025-00393-7

Details

Original language	English
Article number	7
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2025
Issue number	1
Publication status	Published - 12 Feb 2025

Abstract

In this paper, a detailed investigation of deep learning-based speaker detection and localization (SDL) with higher-order Ambisonics signals is conducted. Different spherical harmonic (SH) input features such as the higher-order pseudointensity vector (HO-PIV), relative harmonic coefficients (RHCs), and the spatially-localized pseudointensity vector (SL-PIV), a feature proposed for the first time as an input feature for deep learning-based SDL, are examined using first- to fourth-order SH signals. The trained neural networks, optimized with a single loss function for the combined tasks of detection and localization, are then evaluated in detail for overall SDL performance as well as their performance in the sub-tasks of detection and, particularly, localization. The results are further analyzed in dependence on room reverberation, signal-to-interference ratio (SIR), as well as the number and distances between multiple simultaneously active speakers, utilizing both simulated and measured data. The findings indicate an overall improvement in SDL performance up to third-order Ambisonics for all investigated features, while using fourth-order signals does not yield any further improvement or sometimes even delivers worse results. Notably, the HO-PIV and the SL-PIV, both extensions of the first-order pseudointensity vector (FO-PIV), have proven to be suitable input features. In particular the newly proposed SL-PIV has been found to be the best of the investigated features on third- and fourth-order Ambisonics signals, especially in the most demanding scenarios on measured data, with multiple, closely located speakers and poor SIR.

Keywords

DOA, HOA, SDL, Spherical harmonics, SSL

ASJC Scopus subject areas

Physics and Astronomy(all)
Acoustics and Ultrasonics
Engineering(all)
Electrical and Electronic Engineering

Cite this

Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization. / Poschadel, Nils; Preihs, Stephan; Peissig, Jürgen.
In: Eurasip Journal on Audio, Speech, and Music Processing, Vol. 2025, No. 1, 7, 12.02.2025.

Research output: Contribution to journal › Article › Research › peer review

Poschadel, N, Preihs, S & Peissig, J 2025, 'Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization', Eurasip Journal on Audio, Speech, and Music Processing, vol. 2025, no. 1, 7. https://doi.org/10.1186/s13636-025-00393-7

Poschadel, N., Preihs, S., & Peissig, J. (2025). Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization. Eurasip Journal on Audio, Speech, and Music Processing, 2025(1), Article 7. https://doi.org/10.1186/s13636-025-00393-7

Poschadel N, Preihs S, Peissig J. Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization. Eurasip Journal on Audio, Speech, and Music Processing. 2025 Feb 12;2025(1):7. doi: 10.1186/s13636-025-00393-7

Poschadel, Nils ; Preihs, Stephan ; Peissig, Jürgen. / Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization. In: Eurasip Journal on Audio, Speech, and Music Processing. 2025 ; Vol. 2025, No. 1.

Download

@article{d29cbd1b7e36418dae91bf8b2b451a97,

title = "Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization",

abstract = "In this paper, a detailed investigation of deep learning-based speaker detection and localization (SDL) with higher-order Ambisonics signals is conducted. Different spherical harmonic (SH) input features such as the higher-order pseudointensity vector (HO-PIV), relative harmonic coefficients (RHCs), and the spatially-localized pseudointensity vector (SL-PIV), a feature proposed for the first time as an input feature for deep learning-based SDL, are examined using first- to fourth-order SH signals. The trained neural networks, optimized with a single loss function for the combined tasks of detection and localization, are then evaluated in detail for overall SDL performance as well as their performance in the sub-tasks of detection and, particularly, localization. The results are further analyzed in dependence on room reverberation, signal-to-interference ratio (SIR), as well as the number and distances between multiple simultaneously active speakers, utilizing both simulated and measured data. The findings indicate an overall improvement in SDL performance up to third-order Ambisonics for all investigated features, while using fourth-order signals does not yield any further improvement or sometimes even delivers worse results. Notably, the HO-PIV and the SL-PIV, both extensions of the first-order pseudointensity vector (FO-PIV), have proven to be suitable input features. In particular the newly proposed SL-PIV has been found to be the best of the investigated features on third- and fourth-order Ambisonics signals, especially in the most demanding scenarios on measured data, with multiple, closely located speakers and poor SIR.",

keywords = "DOA, HOA, SDL, Spherical harmonics, SSL",

author = "Nils Poschadel and Stephan Preihs and J{\"u}rgen Peissig",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2025.",

year = "2025",

month = feb,

day = "12",

doi = "10.1186/s13636-025-00393-7",

language = "English",

volume = "2025",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

publisher = "Springer Publishing Company",

number = "1",

}

Download

TY - JOUR

T1 - Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization

AU - Poschadel, Nils

AU - Preihs, Stephan

AU - Peissig, Jürgen

N1 - Publisher Copyright: © The Author(s) 2025.

PY - 2025/2/12

Y1 - 2025/2/12

N2 - In this paper, a detailed investigation of deep learning-based speaker detection and localization (SDL) with higher-order Ambisonics signals is conducted. Different spherical harmonic (SH) input features such as the higher-order pseudointensity vector (HO-PIV), relative harmonic coefficients (RHCs), and the spatially-localized pseudointensity vector (SL-PIV), a feature proposed for the first time as an input feature for deep learning-based SDL, are examined using first- to fourth-order SH signals. The trained neural networks, optimized with a single loss function for the combined tasks of detection and localization, are then evaluated in detail for overall SDL performance as well as their performance in the sub-tasks of detection and, particularly, localization. The results are further analyzed in dependence on room reverberation, signal-to-interference ratio (SIR), as well as the number and distances between multiple simultaneously active speakers, utilizing both simulated and measured data. The findings indicate an overall improvement in SDL performance up to third-order Ambisonics for all investigated features, while using fourth-order signals does not yield any further improvement or sometimes even delivers worse results. Notably, the HO-PIV and the SL-PIV, both extensions of the first-order pseudointensity vector (FO-PIV), have proven to be suitable input features. In particular the newly proposed SL-PIV has been found to be the best of the investigated features on third- and fourth-order Ambisonics signals, especially in the most demanding scenarios on measured data, with multiple, closely located speakers and poor SIR.

AB - In this paper, a detailed investigation of deep learning-based speaker detection and localization (SDL) with higher-order Ambisonics signals is conducted. Different spherical harmonic (SH) input features such as the higher-order pseudointensity vector (HO-PIV), relative harmonic coefficients (RHCs), and the spatially-localized pseudointensity vector (SL-PIV), a feature proposed for the first time as an input feature for deep learning-based SDL, are examined using first- to fourth-order SH signals. The trained neural networks, optimized with a single loss function for the combined tasks of detection and localization, are then evaluated in detail for overall SDL performance as well as their performance in the sub-tasks of detection and, particularly, localization. The results are further analyzed in dependence on room reverberation, signal-to-interference ratio (SIR), as well as the number and distances between multiple simultaneously active speakers, utilizing both simulated and measured data. The findings indicate an overall improvement in SDL performance up to third-order Ambisonics for all investigated features, while using fourth-order signals does not yield any further improvement or sometimes even delivers worse results. Notably, the HO-PIV and the SL-PIV, both extensions of the first-order pseudointensity vector (FO-PIV), have proven to be suitable input features. In particular the newly proposed SL-PIV has been found to be the best of the investigated features on third- and fourth-order Ambisonics signals, especially in the most demanding scenarios on measured data, with multiple, closely located speakers and poor SIR.

KW - DOA

KW - HOA

KW - SDL

KW - Spherical harmonics

KW - SSL

UR - http://www.scopus.com/inward/record.url?scp=85219752026&partnerID=8YFLogxK

U2 - 10.1186/s13636-025-00393-7

DO - 10.1186/s13636-025-00393-7

M3 - Article

AN - SCOPUS:85219752026

VL - 2025

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

SN - 1687-4714

IS - 1

M1 - 7

ER -

Research@Leibniz University

Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization

Authors

Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this