A comparison of strategies for generating artificial replicates in RNA-seq experiments

Research output: Contribution to journalArticleResearchpeer review

Authors

  • Babak Saremi
  • Frederic Gusmag
  • Ottmar Distl
  • Frank Schaarschmidt
  • Julia Metzger
  • Stefanie Becker
  • Klaus Jung

Research Organisations

External Research Organisations

  • University of Veterinary Medicine of Hannover, Foundation
View graph of relations

Details

Original languageEnglish
Article number7170
JournalScientific reports
Volume12
Issue number1
Early online date3 May 2022
Publication statusPublished - Dec 2022

Abstract

Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

ASJC Scopus subject areas

Cite this

A comparison of strategies for generating artificial replicates in RNA-seq experiments. / Saremi, Babak; Gusmag, Frederic; Distl, Ottmar et al.
In: Scientific reports, Vol. 12, No. 1, 7170, 12.2022.

Research output: Contribution to journalArticleResearchpeer review

Saremi B, Gusmag F, Distl O, Schaarschmidt F, Metzger J, Becker S et al. A comparison of strategies for generating artificial replicates in RNA-seq experiments. Scientific reports. 2022 Dec;12(1):7170. Epub 2022 May 3. doi: 10.1038/s41598-022-11302-9
Saremi, Babak ; Gusmag, Frederic ; Distl, Ottmar et al. / A comparison of strategies for generating artificial replicates in RNA-seq experiments. In: Scientific reports. 2022 ; Vol. 12, No. 1.
Download
@article{267d083f28324bb1b95e3912c8ab8edf,
title = "A comparison of strategies for generating artificial replicates in RNA-seq experiments",
abstract = "Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.",
author = "Babak Saremi and Frederic Gusmag and Ottmar Distl and Frank Schaarschmidt and Julia Metzger and Stefanie Becker and Klaus Jung",
note = "Funding Information: Open Access funding enabled and organized by Projekt DEAL. This project received funding from the Deutsche Forschungsgemeinschafft (DFG, German Research Foundation) [398066876/GRK 2485/1]. Funding Information: We thank Heike Klippert-Hasberg (Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover) for technical assistance in the sequencing experiment.",
year = "2022",
month = dec,
doi = "10.1038/s41598-022-11302-9",
language = "English",
volume = "12",
journal = "Scientific reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",
number = "1",

}

Download

TY - JOUR

T1 - A comparison of strategies for generating artificial replicates in RNA-seq experiments

AU - Saremi, Babak

AU - Gusmag, Frederic

AU - Distl, Ottmar

AU - Schaarschmidt, Frank

AU - Metzger, Julia

AU - Becker, Stefanie

AU - Jung, Klaus

N1 - Funding Information: Open Access funding enabled and organized by Projekt DEAL. This project received funding from the Deutsche Forschungsgemeinschafft (DFG, German Research Foundation) [398066876/GRK 2485/1]. Funding Information: We thank Heike Klippert-Hasberg (Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover) for technical assistance in the sequencing experiment.

PY - 2022/12

Y1 - 2022/12

N2 - Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

AB - Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

UR - http://www.scopus.com/inward/record.url?scp=85129336434&partnerID=8YFLogxK

U2 - 10.1038/s41598-022-11302-9

DO - 10.1038/s41598-022-11302-9

M3 - Article

C2 - 35505053

AN - SCOPUS:85129336434

VL - 12

JO - Scientific reports

JF - Scientific reports

SN - 2045-2322

IS - 1

M1 - 7170

ER -

By the same author(s)