Analisi di dati di trascrittomica con R

Introduzione all’RNASeq

Introduzione all’analisi RNASeq in R

Dipartimento di Biomedicina e Prevenzione

Marco Chiapello, Revelo Datalab

2023-03-31

The history of sequencing

Key moments for DNA sequencing:

1953: The structure of the DNA double helix was discovered

James Watson, Francis Crick and Rosalind Franklin

Key moments for DNA sequencing:

1964: Robert Holley, the first person to sequence a tRNA molecule

Robert W. Holley

Key moments for DNA sequencing:

1972: Paul Berg developed the first technology that permitted the isolation of defined DNA fragments

Paul Berg

Key moments for DNA sequencing:

1973: Walter Gilbert published the first nucleotide sequence

Walter Gilbert

Key moments for DNA sequencing:

1973: Walter Gilbert produced ‘DNA sequencing by chemical degradation’.

1973: Frederick Sanger was the first to sequence the complete DNA genome of a bacteriophage, called phi X174

Key moments for DNA sequencing:

1983: DNA amplification technique was developed

Kary Mullis

Key moments for DNA sequencing:

1986: Leroy Hood announced the invention of the first semi-automated DNA sequencing machine

Leroy Hood

Key moments for DNA sequencing:

1987: Applied Biosystems produced the first automated sequencing machine, called ABI370.

Applied Biosystems 370A Prototype Automated DNA Gene Sequencer,

Key moments for DNA sequencing:

1990: The Human Genome Project formally began

Early days: a DNA-sequencing lab in 1994

Key moments for DNA sequencing:

1998: ‘Method of nucleic acid amplification’ was developed

Patent: https://patents.google.com/patent/WO1998044151A1/en

Key moments for DNA sequencing:

2000: A ‘rough draft’ of the human genome was finished by the Human Genome Project

Wadman, M. ‘Rough draft’ of human genome wins researchers’ backing. Nature 393, 399–400 (1998). https://doi.org/10.1038/30790

Key moments for DNA sequencing:

A timeline illustrating the milestones in major genome assembly achievements [Formenti et al., 2020]

History of NGS platforms

First generation NGS

The technology was based on:

the chain-termination method
the chain-degradation method

Chain-degradation method

Chain-termination method

Advantages

Gold standard method for accurate detection of single nucleotide variants and small insertions/deletions
Cost effective where single samples need to be tested very urgently
Less reliant on computational tools than NGS
Longer fragments (up to approximately 1000bp) can be sequenced than in short read NGS

Limitations

Limited throughput
Not cost effective for sequencing many genes in parallel
Can require a larger amount of input DNA than NGS
Sanger methods can only sequence short pieces of DNA–about 300 to 1000 base pairs.
The quality of a Sanger sequence is often not very good in the first 15 to 40 bases because that is where the primer binds.
Sequence quality degrades after 700 to 900 bases

First generation NGS

The main advantages of first-generation NGS technologies are:

Have a good overall sequence output
A high accuracy of 99.99%
Preparing nucleic acids of the ideal size is relatively easy

Second generation NGS

Important

Second-generation NGS machines immediately began to drive the ‘genomics revolution’ by massively increased throughput by parallelizing many reactions

Second generation NGS

Second-generation sequencing platforms:

SOLiD: Sequencing by Oligonucleotide Ligation and Detection
454 GS FLX+: It uses pyrosequencing chemistry
NextSeq 550Dx: Sequence by synthesis

Sequence by Ligation

Considered to be one of the most accurate second-generation sequencing technologies
it can take up to seven days to complete a single run and its short read length of 35 bp
Thermo Fisher Scientific shut down all SOLiD sequencing platforms in 2016

Fluorescently labeled molecules called 8-mers
8-mers are short ssDNA fragments where:
- the first and second nucleotides are the next two nucleotides of the elongating DNA strand
- the third-to-fifth nucleotides are degenerate
- the sixth to eighth nucleotides are inosine bases, where the eighth inosine base is attached to a characteristic fluorescent dye

A. The addition of primer and 8-mers, where if the first two nucleotides are complementary, DNA ligase ligates it to the primer.
B. Unbound 8-mers are washed away, and a laser excites the fluorescent tag on the 8-mer to emit a detectable light signal that is captured
C. A-B is repeated until the end of the DNA fragment.
D. The newly synthesized DNA strand is melted off, and A-C is repeated using a primer with a length of N-1, N-2, and N-3

Pyrosequencing

Large read lenght generation
High reagent cost
High error rate for homopolymers

Sequence by synthesis

Sequence by synthesis - History

1997: Evolution of a Novel Approach to Sequencing

Shankar Balasubramanian and David Klenerman

Sequence by synthesis - History

1998: Formation of Solexa

Sequence by synthesis - History

2004: Molecular Clustering Technology Integration

Cluster generation (also known as “bridge amplification”)

Sequence by synthesis - History

2005: phiX-174 Genome Sequencing

2005: Integration of Lynx Therapeutics

2007: Illumina Acquires Solexa

Sequence by synthesis - Process

Third generation NGS

Important

Third-generation methods allow direct sequencing of single DNA molecules

Third generation NGS

Third-generation NGS platforms:

Single-molecule real-time sequencing
Nanopore sequencing

SMRS

Nanopore sequencing

RNASeq pipeline

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=6>Isolate total RNA) --> B(<font size=6>Enrich a specific type of RNA)
      B --> C(<font size=6>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=6>Sequencing) --> E(<font size=6>Quantify expression)
      E --> F(<font size=6>Differential Expression analysis)
      F --> G(<font size=6>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  class A,B,C,D,E,F,G className;

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class B,C,D,E,F,G className;
  class A classHigh;
  linkStyle 1 stroke:red

Isolate total RNA

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,E,F,G className;
  linkStyle 0 stroke:red

Quality control

RNA integrity is the major factor affecting the quality of sequencing data
Protocols of RNA-Seq libraries require samples with high-quality RNA
Prior to the construction of the libraries, it is necessary to assess RNA integrity number (RIN)
The RIN measurement is based on a machine learning algorithm
A higher RIN value indicates a higher degree of RNA integrity (range 1–10)
RIN does not directly measure mRNA integrity, which is the main genetic material used in the construction of the libraries

::::

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,C,D,E,F,G className;
  class B classHigh;

Enrich a specific type of RNA

mRNA
rRNAs and tRNAs (involved in mRNA translation)
Small nuclear RNAs (involved in splicing)
Small nucleolar RNAs (involved in the modification of rRNAs)
microRNA (regulate gene expression at the posttranscriptional level)
Long noncoding RNAs (chromatin remodelling, transcriptional control and posttranscriptional processing)

::::

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,C,D,E,F,G className;
  class B classHigh;

Enrich a specific type of RNA

Two options for mRNA enrichment

mRNA enrichment – Selectively enriching for poly(A)-tailed transcripts
RNA depletion – Selectively depleting abundant/off-target transcripts

::::

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,D,E,F,G className;
  class C classHigh;

Prepare the RNA sequencing library

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,D,E,F,G className;
  class C classHigh;

Prepare the RNA sequencing library

Important

A sequencing library is essentially a pool of RNA fragments with adapters attached

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,D,E,F,G className;
  class C classHigh;

Prepare the RNA sequencing library Step 1

Fragmentation

Breaking of RNA strands into pieces

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,D,E,F,G className;
  class C classHigh;

Prepare the RNA sequencing library Step 2

Attachment of adapters

Adapters are a short, chemically-synthesised oligonucleotide that can be attached to the ends of DNA molecules
Act as barcodes to identify where each nucleotide was originally located

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,D,E,F,G className;
  class C classHigh;

Prepare the RNA sequencing library Step 3

Library quantification

Important

This is key to obtaining high quality NGS data

It provides the number of nucleic acids ready to be sequenced in a sample
It is important to try and obtain the highest complexity level as possible in an NGS library
Variety of techniques: UV absorption, intercalating dyes, hydrolysis probes and droplet digital emulsion PCR

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,E,F,G className;
  class D classHigh;

Sequencing Step 1

BCL to FASTQ conversion

Illumina sequencing technology uses cluster generation and sequencing by synthesis chemistry to sequence millions or billions of clusters on a flow cell
During sequencing, for each cluster, base calls are made and stored in the form of individual base call (or BCL) files
When sequencing completes, the base calls in the BCL files must be converted into sequence data

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,E,F,G className;
  class D classHigh;

Sequencing Step 1

BCL to FASTQ conversion

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,E,F,G className;
  class D classHigh;

Sequencing Step 2

Pre-process

Assess the quality of sequencing data
Demultiplex by index or barcode
Remove adapter sequences
Trim reads by quality
Discard reads by quality/ambiguity

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,E,F,G className;
  class D classHigh;

Sequencing Step 3

Read mapping

Mapping reads to a reference genome or transcriptome
We need to retrieve a reference genome or transcriptome
- Download from public databases
- Create De Novo

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,E,F,G className;
  class D classHigh;

Sequencing Step 4

Read mapping

Fast (splice-unaware) aligners to a reference transcriptome
Splice-aware aligners to a reference genome
Quasi-mappers (alignment-free mappers) to a reference transcriptome

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,F,G className;
  class E classHigh;

Quantify expression

Normalized expression units

Normalized gene expression units provide consistent and comparable measures to compare and visualize gene expression counts within and across samples
Normalized gene expression units are necessary to remove technical biases

2- such as sequencing depth, RNA composition, and gene length in sequenced data

- more sequencing depth produces more read count for a gene expressed at the same level

- differences in gene length generate unequal reads count for genes expressed at the same level (longer the gene more the read count)

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,F,G className;
  class E classHigh;

Quantify expression Normalized expression units

Reads per million

Definition

RPM (also known as CPM) is a basic gene expression unit that normalizes only for sequencing depth (depth-normalized counts)

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,F,G className;
  class E classHigh;

Quantify expression Normalized expression units

Reads per kilo base of transcript per million mapped reads

Definition

RPKM is normalized to correct the gene (transcript) lengths and library sizes (sequencing depth)

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,F,G className;
  class E classHigh;

Quantify expression Normalized expression units

Transcripts per million

Definition

TPM is a measurement of the proportion of transcripts in your pool of RNA

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,E,G className;
  class F classHigh;

Differentially expression analysis

Important

A basic task in the analysis of count data from RNA-seq is the detection of differentially expressed genes.
It determines genotypical differences between two or more conditions of cells, in support of specific hypothesis-driven studies.
The integration and the visualized representation of DGE result analysis functions can facilitate the downstream studies.

flowchart TB
  subgraph vitro [In vitro]
    direction TB
      A(<font size=3>Isolate total RNA) --> B(<font size=3>Enrich a specific type of RNA)
      B --> C(<font size=3>Prepare the RNA sequencing library)
  end
  subgraph silico [In silico]
    direction TB
      D(<font size=3>Sequencing) --> E(<font size=3>Quantify expression)
      E --> F(<font size=3>Differential Expression analysis)
      F --> G(<font size=3>Biological conclusions)
  end
  vitro --> silico
  classDef className fill:#D1D1D1,stroke:#333,stroke-width:1px
  classDef classHigh fill:#E3BE34,stroke:#333,stroke-width:4px
  class A,B,C,D,E,F className;
  class G classHigh;

Biological conclusion

How many samples do I need?

Power analysis

Type I error: controlled by the α value. Often set to 0.01 (1%) or 0.001 (0.1%) in RNA-seq experiments.

Power analysis

Type I error: controlled by the α value. Often set to 0.01 (1%) or 0.001 (0.1%) in RNA-seq experiments.
Type II error: controlled by the β value. (1−β) will give you the power of your analysis. Should be set to 70 or 80% to detect 70 or 80% of the differentially expressed genes. The number of biological replicates might be hard to reach in practice for RNA-seq experiments.

Power analysis

Type I error: controlled by the α value. Often set to 0.01 (1%) or 0.001 (0.1%) in RNA-seq experiments.
Type II error: controlled by the β value. (1−β) will give you the power of your analysis. Should be set to 70 or 80% to detect 70 or 80% of the differentially expressed genes. The number of biological replicates might be hard to reach in practice for RNA-seq experiments.
Effect size: this is a parameter you will set. For instance, if you want to investigate genes that differ between treatments with a difference of their mean of 2 then the effect size is equal to 2.

Power analysis

Type I error: controlled by the α value. Often set to 0.01 (1%) or 0.001 (0.1%) in RNA-seq experiments.
Type II error: controlled by the β value. (1−β) will give you the power of your analysis. Should be set to 70 or 80% to detect 70 or 80% of the differentially expressed genes. The number of biological replicates might be hard to reach in practice for RNA-seq experiments.
Effect size: this is a parameter you will set. For instance, if you want to investigate genes that differ between treatments with a difference of their mean of 2 then the effect size is equal to 2.
Sample size: the quantity you want to calculate.

Let’s say we want:

Type I error of 5%. (α=0.05)
Type II error of 0.2. (Power=1−β=0.8)
Effect size of 2. (d=2)

library("pwr")

pwr.t.test(d = 2,
           power = .8,
           sig.level = .05,
           type = "two.sample",
           alternative = "two.sided")


     Two-sample t test power calculation 

              n = 5.089995
              d = 2
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Let’s say we want:

Effect size of 1. (d=1)

library("pwr")

pwr.t.test(d = 1,
           power = .8,
           sig.level = .05,
           type = "two.sample",
           alternative = "two.sided")


     Two-sample t test power calculation 

              n = 16.71472
              d = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Power analysis for RNA-seq

General Considerations

RNA-seq experiments often suffer from a low statistical power

General Considerations

RNA-seq experiments often suffer from a low statistical power

Low power can lead to a lack of reproducibility of the research findings

General Considerations

RNA-seq experiments often suffer from a low statistical power

Low power can lead to a lack of reproducibility of the research findings

The number of replicates is one of the critical parameter related to the power of an analysis

Replicates

Klaus B., EMBO J (2015) 34: 2727-2730

Do we need technical replicates?

Important

With the current RNA-Seq technologies, technical variation is much lower than biological variation andtechnical replicates are unneccessary

Do we need biological replicates?

YES

Important

Biological replicates are absolutely essential for differential expression analysis

YES

Important

For differential expression analysis, the more biological replicates, the better the estimates of biological variation and the more precise our estimates of the mean expression levels

Biological replicates are of greater importance than sequencing depth

Liu, Y., et al., Bioinformatics (2014) 30(3): 301–304

Replicates are almost always preferred to greater sequencing depth for bulk RNA-Seq

Important

However, guidelines depend on the experiment performed and the desired analysis.

Here some examples:

General gene-level differential expression
- ENCODE guidelines suggest 30 million SE reads per sample (stranded).
- 15 million reads per sample is often sufficient, if there are a good number of replicates (>3).
- Spend money on more biological replicates, if possible.
- Generally recommended to have read length >= 50 bp

Replicates are almost always preferred to greater sequencing depth for bulk RNA-Seq

Important

However, guidelines depend on the experiment performed and the desired analysis.

Here some examples:

Gene-level differential expression with detection of lowly-expressed genes
- Similarly benefits from replicates more than sequencing depth.
- Sequence deeper with at least 30-60 million reads depending on level of expression (start with 30 million with a good number of replicates). - Generally recommended to have read length >= 50 bp

Replicates are almost always preferred to greater sequencing depth for bulk RNA-Seq

Important

However, guidelines depend on the experiment performed and the desired analysis.

Here some examples:

Isoform-level differential expression
- Of known isoforms, suggested to have a depth of at least 30 million reads per sample and paired-end reads.
- Of novel isoforms should have more depth (> 60 million reads per sample).
- Choose biological replicates over paired/deeper sequencing.
- Generally recommended to have read length >= 50 bp, but longer is better as the reads will be more likely to cross exon junctions
- Perform careful QC of RNA quality. Be careful to use high quality preparation methods and restrict analysis to high quality RIN # samples.

Principles of a good experimental design

Randomization

This should remove undesired and sometimes unknown bias coming from an unidentified source of variation (e.g. different temperatures in the same greehouse).

Replication

By repeating the same minimal experiment more than once, you can estimate the error due to the experimenter manipulation and your treatment effect.

Blocking

Can help to reduce variability unexplained in one’s model. If your group consist of male and female individual, there is a chance that they will not respond in the same way to a given treatment. Therefore, “sex” should be a blocking factor in this case.

Applications

Diagnostics and disease profiling

Transcriptional start sites
Uncovered alternative promoter usage
Novel splicing alterations

Human and pathogen transcriptomes

Quantifying gene expression changes
Identifying novel virulence factors
Predicting antibiotic resistance
Unveiling host-pathogen immune interactions

Responses to environment

Transcriptomics allows for the identification of genes and pathways that respond to biotic and abiotic environmental stresses
The nontargeted nature of transcriptomics allows for the identification of novel transcriptional networks in complex systems
RNA-virus identification for pathogen containment

Gene function annotation

All transcriptomic techniques have been particularly useful in identifying the functions of genes and identifying those responsible for particular phenotypes
Assembly of RNA-Seq reads is not dependent on a reference genome, and it is so ideal for gene expression studies of nonmodel organisms
RNA-Seq can also be used to identify previously unknown protein coding regions in existing sequenced genomes

Noncoding RNA

RNASeq is applicable to noncoding RNAs that are not translated into a protein, but instead, have direct functions
Many of these noncoding RNAs affect disease states, including cancer, cardiovascular, and neurological diseases

Domande?