Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

Judith Stanja; Anett Hoppe; Sarah Dannemann; Johannes Krugel

Details

Originalsprache	Englisch
Titel des Sammelwerks	Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives
Untertitel	Book of Abstracts
Seiten	57
Seitenumfang	1
Publikationsstatus	Veröffentlicht - 23 Aug. 2024
Veranstaltung	EARLI SIG 6&7 Biennial Conference 2024: Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives - University of Tübingen, Tübingen, Deutschland Dauer: 21 Aug. 2024 → 23 Aug. 2024 https://www.earli.org/sig-6-7-conference-2024

Abstract

Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

Zitieren

Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. / Stanja, Judith; Hoppe, Anett; Dannemann, Sarah et al.
Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. S. 57.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Abstract in Konferenzband › Forschung › Peer-Review

Stanja, J, Hoppe, A, Dannemann, S & Krugel, J 2024, Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. in Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. S. 57, EARLI SIG 6&7 Biennial Conference 2024, Tübingen, Baden-Württemberg, Deutschland, 21 Aug. 2024. <https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf>

Stanja, J., Hoppe, A., Dannemann, S., & Krugel, J. (2024). Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. In Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts (S. 57) https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf

Stanja J, Hoppe A, Dannemann S, Krugel J. Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. in Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. S. 57

Stanja, Judith ; Hoppe, Anett ; Dannemann, Sarah et al. / Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. S. 57

Download

@inbook{7c976e86ee3c487e9ae6836d957e5346,

title = "Generation and Evaluation of Synthetic Text Data for the Students{\textquoteright} Conceptions Identification Task",

abstract = "Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.",

author = "Judith Stanja and Anett Hoppe and Sarah Dannemann and Johannes Krugel",

year = "2024",

month = aug,

day = "23",

language = "English",

pages = "57",

booktitle = "Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives",

note = "EARLI SIG 6&7 Biennial Conference 2024 ; Conference date: 21-08-2024 Through 23-08-2024",

url = "https://www.earli.org/sig-6-7-conference-2024",

}

Download

TY - CHAP

T1 - Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

AU - Stanja, Judith

AU - Hoppe, Anett

AU - Dannemann, Sarah

AU - Krugel, Johannes

PY - 2024/8/23

Y1 - 2024/8/23

N2 - Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

AB - Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

M3 - Conference abstract

SP - 57

BT - Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives

T2 - EARLI SIG 6&7 Biennial Conference 2024

Y2 - 21 August 2024 through 23 August 2024

ER -

Research@Leibniz University

Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

Autorschaft

Organisationseinheiten

Details

Abstract

Zitieren

Von denselben Autoren

Block-Based Programming Learning Tool for ML and AI Education (Work in Progress)

Information Encoding in Computer Science Education Using the Cup Song

Formative assessment strategies for students' conceptions—The potential of learning analytics

Information Encoding Modeling in Computer Science Education

A Hybrid Course System on Artificial Intelligence: Insights Into the Didactic Support of Instructors

Block-Based Programming Learning Tool for ML and AI Education (Work in Progress)

Information Encoding in Computer Science Education Using the Cup Song

Formative assessment strategies for students' conceptions—The potential of learning analytics

Information Encoding Modeling in Computer Science Education

A Hybrid Course System on Artificial Intelligence: Insights Into the Didactic Support of Instructors

Block-Based Programming Learning Tool for ML and AI Education (Work in Progress)