Are you looking to get a quick assessment of the quality of your ChIP-seq data? The R package ChIPQC allows you to do just that. Here is an example of ChIPQC usage with ENCODE data.
First, lets install ChIPQC from Bioconductor in R:
We need to download some ENCODE data from the web to use. As an example lets look at the GATA3 and CTCF data from breast and lung cancer cell lines:
# GATA3 ChIP files from A549 cell line (lung cancer) curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Gata3V0422111AlnRep2.bam.bai > A549_GATA3_Myers_Rep2.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Gata3V0422111AlnRep2.bam > A549_GATA3_Myers_Rep2.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Gata3V0422111AlnRep1.bam.bai > A549_GATA3_Myers_Rep1.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Gata3V0422111AlnRep1.bam > A549_GATA3_Myers_Rep1.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Gata3V0422111PkRep2.broadPeak.gz > A549_GATA3_Myers_Rep2.bed.gz curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Gata3V0422111PkRep1.broadPeak.gz > A549_GATA3_Myers_Rep1.bed.gz # CTCF ChIP files from A549 cell line (lung cancer) curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Ctcfsc5916Pcr1xEtoh02AlnRep2.bam.bai > A549_CTCF_Myers_Rep2.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Ctcfsc5916Pcr1xEtoh02AlnRep2.bam > A549_CTCF_Myers_Rep2.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Ctcfsc5916Pcr1xEtoh02AlnRep1.bam.bai > A549_CTCF_Myers_Rep1.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Ctcfsc5916Pcr1xEtoh02AlnRep1.bam > A549_CTCF_Myers_Rep1.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Ctcfsc5916Pcr1xEtoh02PkRep2.broadPeak.gz > A549_CTCF_Myers_Rep2.bed.gz curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsA549Ctcfsc5916Pcr1xEtoh02PkRep1.broadPeak.gz > A549_CTCF_Myers_Rep1.bed.gz # GATA3 ChIP files from MCF7 cell line (breast cancer) curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7Gata3V0422111AlnRep2.bam.bai > MCF7_GATA3_Myers_Rep2.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7Gata3V0422111AlnRep2.bam > MCF7_GATA3_Myers_Rep2.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7Gata3V0422111AlnRep1.bam.bai > MCF7_GATA3_Myers_Rep1.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7Gata3V0422111AlnRep1.bam > MCF7_GATA3_Myers_Rep1.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7Gata3V0422111PkRep2.broadPeak.gz > MCF7_GATA3_Myers_Rep2.bed.gz curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7Gata3V0422111PkRep1.broadPeak.gz > MCF7_GATA3_Myers_Rep1.bed.gz # CTCF ChIP files from MCF7 cell line (breast cancer) curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7CtcfcV0422111AlnRep2.bam.bai > MCF7_CTCF_Myers_Rep2.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7CtcfcV0422111AlnRep2.bam > MCF7_CTCF_Myers_Rep2.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7CtcfcV0422111AlnRep1.bam.bai > MCF7_CTCF_Myers_Rep1.bam.bai curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7CtcfcV0422111AlnRep1.bam > MCF7_CTCF_Myers_Rep1.bam curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7CtcfcV0422111PkRep2.broadPeak.gz > MCF7_CTCF_Myers_Rep2.bed.gz curl http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsMcf7CtcfcV0422111PkRep1.broadPeak.gz > MCF7_CTCF_Myers_Rep1.bed.gz
Now we can read the files using R and ChIPQC. The easiest way to read in the files is using a comma-separated value (csv) sample sheet with one line for each dataset. I created the following sample sheet downloaded it from here :
samples = read.csv("samples.csv") samples
## SampleID Tissue Factor Replicate bamReads ## 1 MCF7_GATA3_1 MCF7 GATA3 1 MCF7_GATA3_Myers_Rep1.bam ## 2 MCF7_GATA3_2 MCF7 GATA3 2 MCF7_GATA3_Myers_Rep2.bam ## 3 MCF7_CTCF_2 MCF7 CTCF 2 MCF7_CTCF_Myers_Rep2.bam ## 4 MCF7_CTCF_1 MCF7 CTCF 1 MCF7_CTCF_Myers_Rep1.bam ## 5 A549_GATA3_1 A549 GATA3 1 A549_GATA3_Myers_Rep1.bam ## 6 A549_GATA3_2 A549 GATA3 2 A549_GATA3_Myers_Rep2.bam ## 7 A549_CTCF_1 A549 CTCF 1 A549_CTCF_Myers_Rep1.bam ## 8 A549_CTCF_2 A549 CTCF 2 A549_CTCF_Myers_Rep2.bam ## Peaks ## 1 MCF7_GATA3_Myers_Rep1.bed ## 2 MCF7_GATA3_Myers_Rep2.bed ## 3 MCF7_CTCF_Myers_Rep2.bed ## 4 MCF7_CTCF_Myers_Rep1.bed ## 5 A549_GATA3_Myers_Rep1.bed ## 6 A549_GATA3_Myers_Rep2.bed ## 7 A549_CTCF_Myers_Rep1.bed ## 8 A549_CTCF_Myers_Rep2.bed
Now we just need to load the data using ChIPQC:
library(ChIPQC) exampleExp = ChIPQC(samples,annotation="hg19") exampleExp
## Samples: 8 : MCF7_GATA3_1 MCF7_GATA3_2 ... A549_CTCF_1 A549_CTCF_2 ## Tissue Factor Replicate Peaks ## MCF7_GATA3_1 MCF7 GATA3 1 42381 ## MCF7_GATA3_2 MCF7 GATA3 2 23604 ## MCF7_CTCF_2 MCF7 CTCF 2 39765 ## MCF7_CTCF_1 MCF7 CTCF 1 39464 ## A549_GATA3_1 A549 GATA3 1 6075 ## A549_GATA3_2 A549 GATA3 2 19702 ## A549_CTCF_1 A549 CTCF 1 33001 ## A549_CTCF_2 A549 CTCF 2 31725 ## Reads Map% Filt% Dup% ReadL FragL RelCC SSD RiP% RiBL% ## MCF7_GATA3_1 1878415 100 0 0 50 114 1.560 1.81 22.10 0.620 ## MCF7_GATA3_2 826002 100 0 0 50 208 2.540 1.49 22.60 0.957 ## MCF7_CTCF_2 1361078 100 0 0 50 110 1.840 2.30 24.50 0.770 ## MCF7_CTCF_1 1974422 100 0 0 50 119 1.880 3.23 22.10 0.871 ## A549_GATA3_1 1573136 100 0 0 50 104 0.983 1.38 2.78 0.972 ## A549_GATA3_2 1896160 100 0 0 50 174 1.910 1.55 13.80 0.855 ## A549_CTCF_1 2268054 100 0 0 36 151 3.080 3.31 15.20 4.060 ## A549_CTCF_2 1439996 100 0 0 36 124 2.980 2.76 20.90 3.710
The table shows a summary of QC metrics, including the total number of peaks and reads in the correspondent files for each sample, the percentage of these that were successfully mapped (Map%), the percentage of filtered reads based on the mapping quality (15 by default) and the duplication rates (Dup%), the read length (ReadL) and the estimated fragment length (FragL).
Our downloaded BAM files contain only the good quality reads, so ChIPQC shows 100% alignment rate and 0% filtered or duplicated reads (having access to the complete BAM files from ENCODE would be great to fully evaluate the quality of the ChIPs in these samples!).
Next the RelativeCC and SSD scores are two metrics of ChIP-enrichment. For all samples, the RelativeCC score is greater than 1, indicating high quality, except for one of the GATA3 samples in A549 cells. The SSD score is sensitive to regions of significant pile-up or reads, here the CTCF samples in both A549 and MCF7 show higher SSD scores (>2) than GATA3 samples and so greatest enrichment for depth of signal.
The final two metrics report the percentage of reads in different regions of interest. RiP reports the percentage of reads that overlap called peaks (%RiP, also known as FRIP). This is another good indication of how ”enriched” the sample is. RiP% values for ChIPs around 5% or higher generally reflect successful enrichment (from ChIPQC vignette). Here, once again, one of the GATA3 samples in A549 cells jumps out with a low %RiP (2.78%) indicating that a big proportion of this library consists of background “noise”.
Finally, RiBL reports the percentage of reads that overlap blacklisted regions (%RiBL). The signal from blacklisted regions has been shown to influence the performance of peak callers Carroll et al., 2014.
A quick way to visualise the relative number of reads that overlap peaks vs. background reads is using the plotRap function. Here it’s clearly visible the low %RiP for one of the GATA3 samples in A549 cells:
The plotCorHeatmap shows the clustering in correlation heatmap based on correlation values for all the peak scores for each sample. In this example (and has we would expect) the CTCF and GATA3 samples form two distinct clusters:
ChIPQC plotRegi plot allows you to analyse signal enrichment at gene regions in a very useful heatmap display. We can spot an enrichment at the 5′ proximal region of genes (5’UTR and 500kb upstream TSS regions):
Finally, a nice thing about ChIPQC is that it allows you to see all these plots and many more in a nice html report, as follows:
By default, a sub directory will be created named ChIPQCreport (this is configurable using the reportFolder parameter). The main interactive HTML page is names “ChIPQC.html” and can be loaded into any web browser.
Check out the report for these ENCODE datasets (I’ve made it available online here: ChIPQC.html).