Assessing ChIP-seq sample quality with ChIPQC

Are you looking to get a quick assessment of the quality of your ChIP-seq data? The R package ChIPQC allows you to do just that. Here is an example of ChIPQC usage with ENCODE data.

First, lets install ChIPQC from Bioconductor in R:


We need to download some ENCODE data from the web to use. As an example lets look at the GATA3 and CTCF data from breast and lung cancer cell lines:

# GATA3 ChIP files from A549 cell line (lung cancer)
curl > A549_GATA3_Myers_Rep2.bam.bai
curl > A549_GATA3_Myers_Rep2.bam
curl > A549_GATA3_Myers_Rep1.bam.bai
curl > A549_GATA3_Myers_Rep1.bam
curl > A549_GATA3_Myers_Rep2.bed.gz
curl > A549_GATA3_Myers_Rep1.bed.gz

# CTCF ChIP files from A549 cell line (lung cancer)
curl > A549_CTCF_Myers_Rep2.bam.bai
curl > A549_CTCF_Myers_Rep2.bam
curl > A549_CTCF_Myers_Rep1.bam.bai
curl > A549_CTCF_Myers_Rep1.bam
curl > A549_CTCF_Myers_Rep2.bed.gz
curl > A549_CTCF_Myers_Rep1.bed.gz

# GATA3 ChIP files from MCF7 cell line (breast cancer)
curl > MCF7_GATA3_Myers_Rep2.bam.bai
curl > MCF7_GATA3_Myers_Rep2.bam
curl > MCF7_GATA3_Myers_Rep1.bam.bai
curl > MCF7_GATA3_Myers_Rep1.bam
curl > MCF7_GATA3_Myers_Rep2.bed.gz
curl > MCF7_GATA3_Myers_Rep1.bed.gz

# CTCF ChIP files from MCF7 cell line (breast cancer)
curl > MCF7_CTCF_Myers_Rep2.bam.bai
curl > MCF7_CTCF_Myers_Rep2.bam
curl > MCF7_CTCF_Myers_Rep1.bam.bai
curl > MCF7_CTCF_Myers_Rep1.bam
curl > MCF7_CTCF_Myers_Rep2.bed.gz
curl > MCF7_CTCF_Myers_Rep1.bed.gz

Now we can read the files using R and ChIPQC. The easiest way to read in the files is using a comma-separated value (csv) sample sheet with one line for each dataset. I created the following sample sheet downloaded it from here :

samples = read.csv("samples.csv")
## SampleID Tissue Factor Replicate bamReads
## 1 MCF7_GATA3_1 MCF7 GATA3 1 MCF7_GATA3_Myers_Rep1.bam
## 2 MCF7_GATA3_2 MCF7 GATA3 2 MCF7_GATA3_Myers_Rep2.bam
## 3 MCF7_CTCF_2 MCF7 CTCF 2 MCF7_CTCF_Myers_Rep2.bam
## 4 MCF7_CTCF_1 MCF7 CTCF 1 MCF7_CTCF_Myers_Rep1.bam
## 5 A549_GATA3_1 A549 GATA3 1 A549_GATA3_Myers_Rep1.bam
## 6 A549_GATA3_2 A549 GATA3 2 A549_GATA3_Myers_Rep2.bam
## 7 A549_CTCF_1 A549 CTCF 1 A549_CTCF_Myers_Rep1.bam
## 8 A549_CTCF_2 A549 CTCF 2 A549_CTCF_Myers_Rep2.bam
## Peaks
## 1 MCF7_GATA3_Myers_Rep1.bed
## 2 MCF7_GATA3_Myers_Rep2.bed
## 3 MCF7_CTCF_Myers_Rep2.bed
## 4 MCF7_CTCF_Myers_Rep1.bed
## 5 A549_GATA3_Myers_Rep1.bed
## 6 A549_GATA3_Myers_Rep2.bed
## 7 A549_CTCF_Myers_Rep1.bed
## 8 A549_CTCF_Myers_Rep2.bed

Now we just need to load the data using ChIPQC:

exampleExp = ChIPQC(samples,annotation="hg19")
## Samples: 8 : MCF7_GATA3_1 MCF7_GATA3_2 ... A549_CTCF_1 A549_CTCF_2
## Tissue Factor Replicate Peaks
## MCF7_GATA3_1 MCF7 GATA3 1 42381
## MCF7_GATA3_2 MCF7 GATA3 2 23604
## MCF7_CTCF_2 MCF7 CTCF 2 39765
## MCF7_CTCF_1 MCF7 CTCF 1 39464
## A549_GATA3_1 A549 GATA3 1 6075
## A549_GATA3_2 A549 GATA3 2 19702
## A549_CTCF_1 A549 CTCF 1 33001
## A549_CTCF_2 A549 CTCF 2 31725
## Reads Map% Filt% Dup% ReadL FragL RelCC SSD RiP% RiBL%
## MCF7_GATA3_1 1878415 100 0 0 50 114 1.560 1.81 22.10 0.620
## MCF7_GATA3_2 826002 100 0 0 50 208 2.540 1.49 22.60 0.957
## MCF7_CTCF_2 1361078 100 0 0 50 110 1.840 2.30 24.50 0.770
## MCF7_CTCF_1 1974422 100 0 0 50 119 1.880 3.23 22.10 0.871
## A549_GATA3_1 1573136 100 0 0 50 104 0.983 1.38 2.78 0.972
## A549_GATA3_2 1896160 100 0 0 50 174 1.910 1.55 13.80 0.855
## A549_CTCF_1 2268054 100 0 0 36 151 3.080 3.31 15.20 4.060
## A549_CTCF_2 1439996 100 0 0 36 124 2.980 2.76 20.90 3.710

The table shows a summary of QC metrics, including the total number of peaks and reads in the correspondent files for each sample, the percentage of these that were successfully mapped (Map%), the percentage of filtered reads based on the mapping quality (15 by default) and the duplication rates (Dup%), the read length (ReadL) and the estimated fragment length (FragL).
Our downloaded BAM files contain only the good quality reads, so ChIPQC shows 100% alignment rate and 0% filtered or duplicated reads (having access to the complete BAM files from ENCODE would be great to fully evaluate the quality of the ChIPs in these samples!).

Next the RelativeCC and SSD scores are two metrics of ChIP-enrichment. For all samples, the RelativeCC score is greater than 1, indicating high quality, except for one of the GATA3 samples in A549 cells. The SSD score is sensitive to regions of significant pile-up or reads, here the CTCF samples in both A549 and MCF7 show higher SSD scores (>2) than GATA3 samples and so greatest enrichment for depth of signal.

The final two metrics report the percentage of reads in different regions of interest. RiP reports the percentage of reads that overlap called peaks (%RiP, also known as FRIP). This is another good indication of how ”enriched” the sample is. RiP% values for ChIPs around 5% or higher generally reflect successful enrichment (from ChIPQC vignette). Here, once again, one of the GATA3 samples in A549 cells jumps out with a low %RiP (2.78%) indicating that a big proportion of this library consists of background “noise”.
Finally, RiBL reports the percentage of reads that overlap blacklisted regions (%RiBL). The signal from blacklisted regions has been shown to influence the performance of peak callers Carroll et al., 2014.

A quick way to visualise the relative number of reads that overlap peaks vs. background reads is using the plotRap function. Here it’s clearly visible the low %RiP for one of the GATA3 samples in A549 cells:



The plotCorHeatmap shows the clustering in correlation heatmap based on correlation values for all the peak scores for each sample. In this example (and has we would expect) the CTCF and GATA3 samples form two distinct clusters:



ChIPQC plotRegi plot allows you to analyse signal enrichment at gene regions in a very useful heatmap display. We can spot an enrichment at the 5′ proximal region of genes (5’UTR and 500kb upstream TSS regions):



Finally, a nice thing about ChIPQC is that it allows you to see all these plots and many more in a nice html report, as follows:


By default, a sub directory will be created named ChIPQCreport (this is configurable using the reportFolder parameter). The main interactive HTML page is names “ChIPQC.html” and can be loaded into any web browser.
Check out the report for these ENCODE datasets (I’ve made it available online here: ChIPQC.html).


6 responses to “Assessing ChIP-seq sample quality with ChIPQC

  1. Pingback: Tools for quality control of RNA-seq and ChIP-seq data | Sequencing QC and data analysis blog·

  2. Hello Ines,
    I am a newbie for ChIP-seq analysis and I am trying to get quality metrics for one our experiments. I would like to ask a question following this post.
    For the Peaks column in the experiment csv, the peak files were downloaded as archives with .broadpeak extension. Is it necessary to change them to .bed? Should they be extracted?


  3. I want to use ChIPQC to check the quality of my Arabidopsis ChIPseq data. But no annotation files for Arabidopsis can be found in ChIPQC package. The Manual said that “you can construct your own annotation; see the package vignette for more information”. I have read all the related files, unfortunately, until now, I don’t know how to build the Arabidopsis annotation files for ChIPQC. Could you tell me how can I build the annotation file for Arabidopsis? (I am a beginner for R).
    Any help would be appreciated!


  4. I’m trying to run ChIPQC and I’m getting this error. What does this mean and how do I fix it?

    Error in Summary.factor(c(208L, 152L, 104L, 158L, 146L, 42L, 50L, 51L, : ‘max’ not meaningful for factors


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s