Details
Original language | English |
---|---|
Article number | 111448 |
Journal | Journal of Systems and Software |
Volume | 193 |
Early online date | 21 Jul 2022 |
Publication status | Published - Nov 2022 |
Abstract
In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant’s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant’s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.
Keywords
- Sentiment analysis, Software projects, Polarity, Development team, Communication
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Information Systems
- Computer Science(all)
- Hardware and Architecture
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Journal of Systems and Software, Vol. 193, 111448, 11.2022.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - On the Subjectivity of Emotions in Software Projects
T2 - How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?
AU - Herrmann, Marc
AU - Obaidi, Martin
AU - Chazette, Larissa
AU - Klünder, Jil
N1 - Funding Information: This research was funded by the Leibniz University Hannover as a Leibniz Young Investigator Grant (Project ComContA, Project Number 85430128, 2020–2022).
PY - 2022/11
Y1 - 2022/11
N2 - Social aspects of software projects become increasingly important for research and practice. Different approaches analyze the sentiment of a development team, ranging from simply asking the team to so-called sentiment analysis on text-based communication. These sentiment analysis tools are trained using pre-labeled data sets from different sources, including GitHub and Stack Overflow.In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant’s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant’s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.
AB - Social aspects of software projects become increasingly important for research and practice. Different approaches analyze the sentiment of a development team, ranging from simply asking the team to so-called sentiment analysis on text-based communication. These sentiment analysis tools are trained using pre-labeled data sets from different sources, including GitHub and Stack Overflow.In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant’s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant’s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.
KW - Sentiment analysis
KW - Software projects
KW - Polarity
KW - Development team
KW - Communication
UR - http://www.scopus.com/inward/record.url?scp=85134891383&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2207.07954
DO - 10.48550/arXiv.2207.07954
M3 - Article
VL - 193
JO - Journal of Systems and Software
JF - Journal of Systems and Software
SN - 0164-1212
M1 - 111448
ER -