Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAbstract in KonferenzbandForschungPeer-Review

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksInstructional Design and Technology Enhanced Learning: Current States and Future Perspectives
UntertitelBook of Abstracts
Seiten57
Seitenumfang1
PublikationsstatusVeröffentlicht - 23 Aug. 2024
VeranstaltungEARLI SIG 6&7 Biennial Conference 2024: Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives - University of Tübingen, Tübingen, Deutschland
Dauer: 21 Aug. 202423 Aug. 2024
https://www.earli.org/sig-6-7-conference-2024

Abstract

Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

Zitieren

Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. / Stanja, Judith; Hoppe, Anett; Dannemann, Sarah et al.
Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. S. 57.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAbstract in KonferenzbandForschungPeer-Review

Stanja, J, Hoppe, A, Dannemann, S & Krugel, J 2024, Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. in Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. S. 57, EARLI SIG 6&7 Biennial Conference 2024, Tübingen, Baden-Württemberg, Deutschland, 21 Aug. 2024. <https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf>
Stanja, J., Hoppe, A., Dannemann, S., & Krugel, J. (2024). Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. In Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts (S. 57) https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf
Stanja J, Hoppe A, Dannemann S, Krugel J. Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. in Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. S. 57
Stanja, Judith ; Hoppe, Anett ; Dannemann, Sarah et al. / Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. S. 57
Download
@inbook{7c976e86ee3c487e9ae6836d957e5346,
title = "Generation and Evaluation of Synthetic Text Data for the Students{\textquoteright} Conceptions Identification Task",
abstract = "Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.",
author = "Judith Stanja and Anett Hoppe and Sarah Dannemann and Johannes Krugel",
year = "2024",
month = aug,
day = "23",
language = "English",
pages = "57",
booktitle = "Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives",
note = "EARLI SIG 6&amp;7 Biennial Conference 2024 ; Conference date: 21-08-2024 Through 23-08-2024",
url = "https://www.earli.org/sig-6-7-conference-2024",

}

Download

TY - CHAP

T1 - Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

AU - Stanja, Judith

AU - Hoppe, Anett

AU - Dannemann, Sarah

AU - Krugel, Johannes

PY - 2024/8/23

Y1 - 2024/8/23

N2 - Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

AB - Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

M3 - Conference abstract

SP - 57

BT - Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives

T2 - EARLI SIG 6&amp;7 Biennial Conference 2024

Y2 - 21 August 2024 through 23 August 2024

ER -

Von denselben Autoren