BioinfoDocs - Tutorials

Quality Control

To use FastQC to conduct the quality control for high-throughput sequence data.

Codes

run single sample

fastqc <seq1.fastq seq2.fastq ... seqN.fastq>

run multiple samples

multiqc </path/to/*_fastqc.zip>

References

FastQC

MultiQC

Alignment

To utilize "STAR" to align sequencing reads to the reference genome

Codes

Create a genome index

STAR --runThreadN <NumberOfThreads> \ --runMode genomeGenerate \ --genomeDir </path/to/STARgenomeDir> \ --genomeFastaFiles </path/to/genome/fasta> \ --sjdbGTFfile </path/to/annotations.gtf> \ --sjdbOverhang <ReadLength-1> \ --genomeSAindexNbases <indexNbases>

<indexNbases> = min(14, log2(GenomeLength)/2 - 1)

Map reads to the genome

STAR --twopassMode Basic \ --runThreadN <NumberOfThreads> \ --outSAMtype BAM Unsorted \ --outFilterMultimapNmax 1 \ --chimSegmentMin 20 \ --quantMode TranscriptomeSAM \ --genomeDir </path/to/genomeDir> \ --sjdbGTFfile </path/to/reads> \ --outFileNamePrefix </path/to/output/dir/prefix> \ --readFilesIn </path/to/fastq_file> \ --readFilesCommand <uncompressedMethod>

<uncompressedMethod> (optional): "zcat" if input is "fastq.gz"; "bunzip2 -c" if input is "fastq.bz2";

References

STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21.

STAR GitHub

Quantification

To utilize "RSEM" to quantify gene and isoform abundances

Codes

Create a genome index

rsem-prepare-reference -p <NumberOfThreads> \ --gtf </path/to/annotations.gtf> \ </path/to/genome/fasta> \ </path/to/RSEMgenomeDir>

Estimate gene-level abundance

rsem-calculate-expression -p <NumberOfThreads> \ --quiet \ --bam \ --no-bam-output \ [--paired-end] \ </path/to/Aligned.toTranscriptome.out.bam> \ </path/to/RSEMindexDir> \ <output_prefix>

References

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011 Aug 4;12:323.

RSEM GitHub

Normalization

To utilize "edgeR" to normalize the gene counts

Codes

Creates a DGEList object from a table of counts (rows=features, columns=samples)

#R code library(edgeR) all.data <- DGEList(counts = count.table, group = group)

Filtering genes that are not expressed based on the count threshold & the repeat threshold

all.data <- all.data[rowSums(cpm(all.data) > count_threshold) >= rep_threshold, , keep.lib.sizes=FALSE]

Normalising gene expression distributions

all.data <- calcNormFactors(all.data)

Log-Transform the gene count

lcpm <- cpm(all.data, log=TRUE)

References

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40.

edgeR R package

Dimension Reduction

To use R package to conduct the Principal Component Analysis

Codes

PCA reduction

#R code library(factoextra) pca <- PCA(t(lcpm), graph = FALSE)

Get eigenvalue

fviz_eig(pca, addlabels = TRUE)

Get variances

get_pca_var(pca)

Log-Transform the gene count

fviz_pca_ind(pca, geom = c("point"), habillage = group)

References

factoextra R package

Differential Gene Expression

To use "edgeR" R package to conduct the Differential Expression Gene test

Codes

To set up the design matrix

#R code library(edgeR) triplicate Examples grp <- factor(c("ctrl", "ctrl", "ctrl", "test", "test", “test”), levels = c(ctrl, test), labels = c(ctrl, test)) design <- model.matrix(~grp) colnames(design) <- levels(grp) rownames(design) <- colnames(all.data) all.data <- estimateDisp(all.data, design, robust=TRUE)

Exact test

et <- exactTest(all.data) is.de <- decideTestsDGE(et, p.value = 0.05, lfc = 1, adjust.method = “fdr”) degsum <- summary(is.de)

Differential gene list

degtable <- et$table degtable$Pvalue <- p.adjust(method="fdr",p=et$table$PValue) degtable$DGEtest <- is.de@.Data

References

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40.

edgeR R package

Heatmap

To use "pheatmap" to draw clustered heatmaps

Codes

Draw Heatmap from log CPM (example)

GOItable <- lcpm[deglist,] pheatmap(GOItable, cex=1, scale = "row", show_rownames = FALSE)

References

pheatmap R package

GSEA (Gene Set Enrichment Analysis)

To use GSEA & MSigDB to conduct the Gene Set Enrichment Analysis

References:

GSEA

GSEA_R

Pathway Commons

HOW TO PERFORM GSEA - A tutorial on gene set enrichment analysis for RNA-seq (Video)

Examples

IMB Bioinformatics Core Pipeline: bbsc_portal

STAR --twopassMode Basic --outSAMtype BAM Unsorted --outFilterMultimapNmax 1 --chimSegmentMin 20 --quantMode TranscriptomeSAM --genomeDir path_to_ref_genome/star_index --sjdbGTFfile path_to_ref_genome/ref.gtf --outFileNamePrefix star_rsem/sample --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz --runThreadN 16 --readFilesCommand zcat

rsem-calculate-expression --quiet -p 8 --bam --no-bam-output --paired-end star_rsem/sampleAligned.toTranscriptome.out.bam path_to_ref_genome/rsem_index/ref sample
                                        
samtools sort -@ 8 -o sorted_bam/sample.bam star_rsem/sampleAligned.out.bam