Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers

Jan Karem Höhne; Timo Lenzner; Joshua Claassen

doi:10.1080/13645579.2024.2443633

Details

Original language	English
Journal	International Journal of Social Research Methodology
Early online date	1 Jan 2025
Publication status	E-pub ahead of print - 1 Jan 2025

Abstract

Advances in information and communication technology, coupled with a smartphone increase in web surveys, provide new avenues for collecting answers from respondents. Specifically, the microphones of smartphones facilitate the collection of voice instead of text answers to open questions. Speech-to-text transcriptions through Automatic Speech Recognition (ASR) systems pose an efficient way to make voice answers accessible to text-as-data methods. However, there is little evidence on the transcription performance of ASR systems when it comes to voice answers. We therefore investigate the performance of two leading ASR systems–Google’s Cloud Speech-to-Text API and OpenAI’s Whisper–using voice answers to two open questions administered in a smartphone survey in Germany. The results indicate that Whisper produces more accurate transcriptions than Google’s API. Both systems produce similar errors, but these errors are more common for the Google API. However, the Google API is faster than both Whisper and human transcribers.

Keywords

Automatic speech recognition (ASR), built-in microphone, narrative questions, smartphone survey, transcription quality

ASJC Scopus subject areas

Social Sciences(all)
General Social Sciences

Cite this

Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers. / Höhne, Jan Karem; Lenzner, Timo; Claassen, Joshua.
In: International Journal of Social Research Methodology, 01.01.2025.

Research output: Contribution to journal › Article › Research › peer review

Höhne, JK, Lenzner, T & Claassen, J 2025, 'Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers', International Journal of Social Research Methodology. https://doi.org/10.1080/13645579.2024.2443633

Höhne, J. K., Lenzner, T., & Claassen, J. (2025). Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers. International Journal of Social Research Methodology. Advance online publication. https://doi.org/10.1080/13645579.2024.2443633

Höhne JK, Lenzner T, Claassen J. Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers. International Journal of Social Research Methodology. 2025 Jan 1. Epub 2025 Jan 1. doi: 10.1080/13645579.2024.2443633

Höhne, Jan Karem ; Lenzner, Timo ; Claassen, Joshua. / Automatic speech-to-text transcription : evidence from a smartphone survey with voice answers. In: International Journal of Social Research Methodology. 2025.

Download

@article{1f355341e8c6461b887f6e9fcdcfb3fb,

title = "Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers",

abstract = "Advances in information and communication technology, coupled with a smartphone increase in web surveys, provide new avenues for collecting answers from respondents. Specifically, the microphones of smartphones facilitate the collection of voice instead of text answers to open questions. Speech-to-text transcriptions through Automatic Speech Recognition (ASR) systems pose an efficient way to make voice answers accessible to text-as-data methods. However, there is little evidence on the transcription performance of ASR systems when it comes to voice answers. We therefore investigate the performance of two leading ASR systems–Google{\textquoteright}s Cloud Speech-to-Text API and OpenAI{\textquoteright}s Whisper–using voice answers to two open questions administered in a smartphone survey in Germany. The results indicate that Whisper produces more accurate transcriptions than Google{\textquoteright}s API. Both systems produce similar errors, but these errors are more common for the Google API. However, the Google API is faster than both Whisper and human transcribers.",

keywords = "Automatic speech recognition (ASR), built-in microphone, narrative questions, smartphone survey, transcription quality",

author = "H{\"o}hne, {Jan Karem} and Timo Lenzner and Joshua Claassen",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.",

year = "2025",

month = jan,

day = "1",

doi = "10.1080/13645579.2024.2443633",

language = "English",

journal = "International Journal of Social Research Methodology",

issn = "1364-5579",

publisher = "Taylor and Francis Ltd.",

}

Download

TY - JOUR

T1 - Automatic speech-to-text transcription

T2 - evidence from a smartphone survey with voice answers

AU - Höhne, Jan Karem

AU - Lenzner, Timo

AU - Claassen, Joshua

PY - 2025/1/1

Y1 - 2025/1/1

N2 - Advances in information and communication technology, coupled with a smartphone increase in web surveys, provide new avenues for collecting answers from respondents. Specifically, the microphones of smartphones facilitate the collection of voice instead of text answers to open questions. Speech-to-text transcriptions through Automatic Speech Recognition (ASR) systems pose an efficient way to make voice answers accessible to text-as-data methods. However, there is little evidence on the transcription performance of ASR systems when it comes to voice answers. We therefore investigate the performance of two leading ASR systems–Google’s Cloud Speech-to-Text API and OpenAI’s Whisper–using voice answers to two open questions administered in a smartphone survey in Germany. The results indicate that Whisper produces more accurate transcriptions than Google’s API. Both systems produce similar errors, but these errors are more common for the Google API. However, the Google API is faster than both Whisper and human transcribers.

AB - Advances in information and communication technology, coupled with a smartphone increase in web surveys, provide new avenues for collecting answers from respondents. Specifically, the microphones of smartphones facilitate the collection of voice instead of text answers to open questions. Speech-to-text transcriptions through Automatic Speech Recognition (ASR) systems pose an efficient way to make voice answers accessible to text-as-data methods. However, there is little evidence on the transcription performance of ASR systems when it comes to voice answers. We therefore investigate the performance of two leading ASR systems–Google’s Cloud Speech-to-Text API and OpenAI’s Whisper–using voice answers to two open questions administered in a smartphone survey in Germany. The results indicate that Whisper produces more accurate transcriptions than Google’s API. Both systems produce similar errors, but these errors are more common for the Google API. However, the Google API is faster than both Whisper and human transcribers.

KW - Automatic speech recognition (ASR)

KW - built-in microphone

KW - narrative questions

KW - smartphone survey

KW - transcription quality

UR - http://www.scopus.com/inward/record.url?scp=85214406860&partnerID=8YFLogxK

U2 - 10.1080/13645579.2024.2443633

DO - 10.1080/13645579.2024.2443633

M3 - Article

AN - SCOPUS:85214406860

JO - International Journal of Social Research Methodology

JF - International Journal of Social Research Methodology

SN - 1364-5579

ER -

Research@Leibniz University

Automatic speech-to-text transcription: evidence from a smartphone survey with voice answers

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this