Quality control of ChIP-seq and RNA-seq experiments is critical to the interpretation of RNA-seq and ChIP-seq data, and it is an essential step in any bioinformatics analysis. Here I review some of my favourite tools to evaluate RNA-seq and ChIP-seq data:
ChIPQC offers several functions for the calculation of quality metrics for ChIP-seq data. ChIPQC can be run for a single dataset (a single bam file) or for a complete ChIP-seq experiment (i.e. with several replicates, conditions and cell lines/individuals and their associated controls). It can also be used to generate a summary HTML quality report of the ChIP-seq experiment. ChIPQC implements functions to retrieve the number of reads, number of reads overlapping blacklists (or other custom lists), fragment length estimation, cross-correlation profile and other well known ChIP enrichment metrics proposed by the ENCODE consortium such as the FRiP, NSC, and RSC (see also my previous post on ChIPQC)
GreyListChIP is an R package that can be used to create a “greylist” of genomic positions of anomalous high signal in input samples (see my previous post on GreyListChIP). This greylist can then be used to filter potential artifact regions prior to peak calling (to improve peak caller performance).
BlackOPs can be used to characterize mappability of RNA-Seq reads and create a “blacklist” of genomic positions of mismapped reads. This blacklist can then be used to filter potential false positives from variant or RNA editing calls.
htSeqTools is a very useful and popular software for quality control, visualization and processing for high-throughput sequencing data (ChIP-seq, RNAseq etc.). htSeqTools implements functions to compare read coverage across samples using Multi-Dimensional Scaling (MDS) plots (analogous to PCA), detect inefficient immuno-precipitation or over-amplification artifacts, identify and test for genomic regions with large accumulation of reads, and visualization of coverage profiles.
similaRpeak is an R package with a really smart idea. It offers a number of metrics to estimate the level of similarity between two ChIP-Seq profiles. The level of similarity between two ChIP-Seq profiles is estimated for a specific region. The vignette shows an example workflow of how to get the coverage vectors (profiles) for specific regions of your ChIP-seq samples and then how to compute the similarity metrics with similaRpeak.
ChIPseeker is an R package that implements functions to retrieve the nearest genes around the peak, annotate genomic regions of the peak. It also implements statistical methods to estimate the significance of overlap among ChIP peak data sets. A nice feature of ChIPseeker is that it allows users to incorporate datasets deposited in the GEO database for comparison with their own dataset.
RNA-SeQC is a nice java program from the Broad Institute for quality control of RNA-seq data. RNA-seQC computes quality metrics such as the yield, alignment and duplication rates, GC bias, rRNA contamination, regions of alignment (exon, intron and intragenic), continuity of coverage, 3’/5′ bias, and count of detectable transcripts. It also provides multi-sample evaluation. RNA-SeQC is built on the GATK as well as the Picard API. Also see www.genepattern.org to run online.
RSeQC is another great software for RNA-seq QC, it is written in Python and C. It allows evaluation of sequencing saturation status by resampling (jackknifing) the total mapped reads, this method can be used to evaluate the precision of the estimated RPKMs (at the current sequencing depth) and also the ability to perform alternative splicing analyses. Provides a number of python scripts to inspect mapped read distributions, coverage uniformity over the gene body, reproducibility, strand specificity and splice junction annotation. RSeQC also includes several useful tools to manipulate and normalize BigWig files for data visualization.
FASTQC is a popular tool from the Babraham Institute. It is a fast cross-platform application, written in Java. FASTQC produces a number of raw sequence-related metrics such as sequence quality per base/cycle, nucleotide composition per sequence, sequence duplication levels, adaptor and Kmer content, and GC bias. Accepts data from BAM, SAM or FastQ files and the output is a HTML report with a number of summary graphs and tables.
SAMStat is another software focused on raw sequence-related metrics. SAMStat is written in C, it is also extremely fast. Reports nucleotide composition, length distribution, base quality distribution, mapping statistics, mismatch, insertion and deletion error profiles, di-nucleotide and 10-mer over-representation. It accepts SAM and BAM files and the output is a HTML report. An important differentiator from FASTQC is that all statistics are reported for unmapped, poorly and accurately mapped reads separately.
seqbias is an interesting R package. I never tried it myself but I think the idea is cool. It uses a Bayesian framework to model the per-position sequencing bias in sequencing data. The structure and parameters are trained on a set of aligned reads and a reference genome sequence.
Finally, the IGV genome browser is my ultimate recommendation for data quality assessment. For instance, you can use IGV to zoom into a particular position of interest and see the Phred quality scores (Q scores) assigned to each read at that position just by moving the mouse pointer over the aligned reads. It is also very useful for inspecting single variants. Variants are indicated by vertical lines that intersect reads and when you zoom in on a variant you can see how many times that variant was found at that position, and the percentage of reads carrying that variant. One can also visually inspect the structure of ChIP-seq signal in peaks and, for instance, make sure that there are no artifacts (“stacks” = perfectly identical pileups) instead of real signal, etc.
Please add comments bellow if you feel like mentioning any other relevant-new-awesome-cool piece of software!