RNA-Seq

Last updated: Nov 15th, 2021

Quality Control

Codes
  • run single sample

    fastqc <seq1.fastq seq2.fastq ... seqN.fastq>

  • run multiple samples

    multiqc </path/to/*_fastqc.zip>


  • References
  • FastQC
  • MultiQC
  • Alignment

    Codes
  • Create a genome index

    STAR --runThreadN <NumberOfThreads> \
         --runMode genomeGenerate \
         --genomeDir </path/to/STARgenomeDir> \
         --genomeFastaFiles </path/to/genome/fasta> \
         --sjdbGTFfile </path/to/annotations.gtf> \
         --sjdbOverhang <ReadLength-1> \
         --genomeSAindexNbases <indexNbases>

    <indexNbases> = min(14, log2(GenomeLength)/2 - 1)
  • Map reads to the genome

    STAR --twopassMode Basic \
         --runThreadN <NumberOfThreads> \
         --outSAMtype BAM Unsorted \
         --outFilterMultimapNmax 1 \
         --chimSegmentMin 20 \
         --quantMode TranscriptomeSAM \
         --genomeDir </path/to/genomeDir> \
         --sjdbGTFfile </path/to/reads> \
         --outFileNamePrefix </path/to/output/dir/prefix> \
         --readFilesIn </path/to/fastq_file> \
         --readFilesCommand <uncompressedMethod>

    <uncompressedMethod> (optional): "zcat" if input is "fastq.gz"; "bunzip2 -c" if input is "fastq.bz2";

  • References
  • STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21.
  • STAR GitHub
  • Quantification

    Codes
  • Create a genome index

    rsem-prepare-reference -p <NumberOfThreads> \
                           --gtf </path/to/annotations.gtf> \
                           </path/to/genome/fasta> \
                           </path/to/RSEMgenomeDir>

  • Estimate gene-level abundance

    rsem-calculate-expression -p <NumberOfThreads> \
                              --quiet \
                              --bam \
                              --no-bam-output \
                              [--paired-end] \
                              </path/to/Aligned.toTranscriptome.out.bam> \
                              </path/to/RSEMindexDir> \
                              <output_prefix>


  • References
  • RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011 Aug 4;12:323.
  • RSEM GitHub
  • Normalization

    Codes
  • Creates a DGEList object from a table of counts (rows=features, columns=samples)

    #R code
    library(edgeR)
    all.data <- DGEList(counts = count.table, group = group)

  • Filtering genes that are not expressed based on the count threshold & the repeat threshold

    all.data <- all.data[rowSums(cpm(all.data) > count_threshold) >= rep_threshold, , keep.lib.sizes=FALSE]

  • Normalising gene expression distributions

    all.data <- calcNormFactors(all.data)

  • Log-Transform the gene count

    lcpm <- cpm(all.data, log=TRUE)


  • References
  • edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40.
  • edgeR R package
  • Dimension Reduction

    Codes
  • PCA reduction

    #R code
    library(factoextra)
    pca <- PCA(t(lcpm), graph = FALSE)

  • Get eigenvalue

    fviz_eig(pca, addlabels = TRUE)

  • Get variances

    get_pca_var(pca)

  • Log-Transform the gene count

    fviz_pca_ind(pca, geom = c("point"), habillage = group)


  • References
  • factoextra R package
  • Differential Gene Expression

    Codes
  • To set up the design matrix

    #R code
    library(edgeR)
    triplicate Examples
    grp <- factor(c("ctrl", "ctrl", "ctrl", "test", "test", “test”), levels = c(ctrl, test), labels = c(ctrl, test))
    design <- model.matrix(~grp)
    colnames(design) <- levels(grp)
    rownames(design) <- colnames(all.data)
    all.data <- estimateDisp(all.data, design, robust=TRUE)

  • Exact test

    et <- exactTest(all.data)
    is.de <- decideTestsDGE(et, p.value = 0.05, lfc = 1, adjust.method = “fdr”)
    degsum <- summary(is.de)

  • Differential gene list

    degtable <- et$table
    degtable$Pvalue <- p.adjust(method="fdr",p=et$table$PValue)
    degtable$DGEtest <- is.de@.Data


  • References
  • edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40.
  • edgeR R package
  • Heatmap

    Codes
  • Draw Heatmap from log CPM (example)

    GOItable <- lcpm[deglist,]
    pheatmap(GOItable, cex=1, scale = "row", show_rownames = FALSE)


  • References
  • pheatmap R package
  • GSEA (Gene Set Enrichment Analysis)

    Examples

    IMB Bioinformatics Core Pipeline: bbsc_portal
    STAR --twopassMode Basic --outSAMtype BAM Unsorted --outFilterMultimapNmax 1 --chimSegmentMin 20 --quantMode TranscriptomeSAM --genomeDir path_to_ref_genome/star_index --sjdbGTFfile path_to_ref_genome/ref.gtf --outFileNamePrefix star_rsem/sample --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz --runThreadN 16 --readFilesCommand zcat
    
    rsem-calculate-expression --quiet -p 8 --bam --no-bam-output --paired-end star_rsem/sampleAligned.toTranscriptome.out.bam path_to_ref_genome/rsem_index/ref sample
                                            
    samtools sort -@ 8 -o sorted_bam/sample.bam star_rsem/sampleAligned.out.bam