Consistency accross ChIP-seq replicates using non-overlapping windows

Here is a quick way of evaluating the consistency across replicated samples by comparing the ChIP-signal (read counts) across non-overlapping windows.

First, lets download some data from the web:

# input data from A549 cells (3 reps)
curl > A549_Input_Myers_Rep1.bam.bai
curl > A549_Input_Myers_Rep1.bam
curl > A549_Input_Myers_Rep2.bam.bai
curl > A549_Input_Myers_Rep2.bam
curl > A549_Input_Myers_Rep3.bam.bai
curl > A549_Input_Myers_Rep3.bam

# input data from MCF7 cells (4 reps)
curl > MCF7_Input_Myers_Rep4.bam.bai
curl > MCF7_Input_Myers_Rep4.bam
curl > MCF7_Input_Myers_Rep3.bam.bai
curl > MCF7_Input_Myers_Rep3.bam
curl > MCF7_Input_Myers_Rep2.bam.bai
curl > MCF7_Input_Myers_Rep2.bam
curl > MCF7_Input_Myers_Rep1.bam.bai
curl > MCF7_Input_Myers_Rep1.bam

These are input datasets for ChIP-seq experiments. There are 3 replicates for A549 and 4 replicates for MCF7.
First, lets read the BAM files into R. Because reading large BAM files can be time-consuming, we can use the optional ScanBamParam instance (from Rsamtools package) to limit scanning only to chr1 (hg19):

chr1_length <- seqlengths(Hsapiens)[["chr1"]] #length of chromosome 1
## [1] 249250621
param <- ScanBamParam(which=GRanges("chr1", IRanges(1, chr1_length)))

Read the BAM files (chr1 only) using readGAlignments function from the GenomicAlignments package:

bamfiles = c("A549_Input_Myers_Rep1.bam","A549_Input_Myers_Rep2.bam","A549_Input_Myers_Rep3.bam",
aln <- lapply(bamfiles, readGAlignments, param=param)

Now lets compute read counts over non-overlapping 10kb windows covering the entire chromosome 1.
For this, I am using the GenomicRanges countOverlaps function as follows:

makewindows <- function(chr, windowSize, organinsm=Hsapiens)
    wstart <- seq(1,seqlengths(organinsm)[[chr]] - windowSize, windowSize)
    windows <- GRanges(chr, IRanges(wstart, wstart+(windowSize-1)))

windows <- makewindows(chr="chr1", windowSize = 10000, organinsm=Hsapiens)

# generate counts
counts <- lapply(aln, GenomicRanges::countOverlaps, query=windows)
names(counts) <- c(paste0("A459_", 1:3), paste0("MCF7_", 1:4))

We now have a vector of 24,925 windows for each BAM file that covers the entire chromosome 1:

sapply(counts, length)

## A459_1 A459_2 A459_3 MCF7_1 MCF7_2 MCF7_3 MCF7_4
## 24925 24925 24925 24925 24925 24925 24925

We can visualise the correlation coefficients between each dataset in a nice heatmap:

cormat <- cor("cbind",counts))

colfunc <- colorRampPalette(c("grey98",brewer.pal(6,"Blues")))
heatmap.2(cormat, col=colfunc(10),margins=c(15,15),symm=F,
  trace="none",distfun=function (x) as.dist(1-x),hclust=function(x) hclust(x,method="ward.D"))


The replicates correlate very well, with correlation coefficients >0.95 between replicates most of the time, except for replicate 3 from the A549 sample set:

## A459_1 A459_2 A459_3 MCF7_1 MCF7_2 MCF7_3
## A459_1 1.0000000 0.9886087 0.8249716 0.8334723 0.8957434 0.9040974
## A459_2 0.9886087 1.0000000 0.8406721 0.8145461 0.8940076 0.9044805
## A459_3 0.8249716 0.8406721 1.0000000 0.6805418 0.7518779 0.7617491
## MCF7_1 0.8334723 0.8145461 0.6805418 1.0000000 0.9533133 0.9493613
## MCF7_2 0.8957434 0.8940076 0.7518779 0.9533133 1.0000000 0.9929258
## MCF7_3 0.9040974 0.9044805 0.7617491 0.9493613 0.9929258 1.0000000
## MCF7_4 0.8939547 0.8909133 0.7511963 0.9698405 0.9850725 0.9857446
## MCF7_4
## A459_1 0.8939547
## A459_2 0.8909133
## A459_3 0.7511963
## MCF7_1 0.9698405
## MCF7_2 0.9850725
## MCF7_3 0.9857446
## MCF7_4 1.0000000

Scatterplot matrices are also a useful way of displaying the pairwise relations between replicates in each cell line:

pairs("cbind",counts[4:7]), col=rgb(0,100,0,50,maxColorValue=255), pch=16, log="xy", main="MCF7")


pairs("cbind",counts[1:3]), col=rgb(0,100,0,50,maxColorValue=255), pch=16, log="xy", main="A549")


Finally, If we’re happy with the correlation between replicated samples, we can merge the BAM files into just one using the mergeBam function:

mergeBam(bamfiles[1:3], "A549_input_merged.bam")
mergeBam(bamfiles[4:7], "MCF7_input_merged.bam")

6 responses to “Consistency accross ChIP-seq replicates using non-overlapping windows

  1. Pingback: Your ChIP in “50 shades of grey”! | Seq QC·

  2. I got a question for correlation type analyses.

    Why using coverage, (are we referring to genome coverage?) to calculate the coefficiency/correlation of chip-seq experiments and not using coordinates+read counts as in macs bed output?
    If I have only the bed files can I use your proposed method?

    Is there a tutorial that will explain the difference between the usage of coverage and read count per coordinates?

    thank you

    P.S. I got this also to your biostar answer her:


  3. Hi Ines,
    I got another question.
    In your script you are using only a part of the the whole sequence file (chr1 in your case)
    that is good for some cases, but If I want to do the correlation on the whole bam file what should I do?

    thank you once more

    Liked by 1 person

    • Hi Theodore. If you are looking for chromosome specific effects, You can do the whole code in a loop, and plot every chromosome in separate.
      That is the best option. Combining all chromosomes into just 1 plot would take very long, but if you want that, I would run the code in a loop then combine all “counts” into just one “counts” object and plot it


      • Thank you Ines for your answer
        The point is that I would like to do correlation analysis in whole bam files derived from chip-seq data. It is important to test the whole genome for similarities


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s