Since the movie ’50 shades of Grey’ is about to be released I thought this is the perfect opportunity to introduce everybody to the concept of “grey lists” and the recent R package developed by Gordon Brown at my institute: The GreyListChIP R package!
ChIP-seq and many other NextGen sequencing experiments (e.g. MNase-seq, DNase-seq, FAIRE-seq) often produce artifact signal in certain regions of the genome. These so called blacklisted regions are often found at repeat elements (such as satellite, centromeric and telomeric repeats), and show unstructured and high signal (excessive pile up of reads) independently of cell line and experiment type. The ENCODE project generated two sets of human blacklists (the DAC and DUKE regions, see here: http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability).
Blacklisted regions are known to present problems for fragment length estimation and signal normalisation between samples. Although often found at repeat elements, reads typically map uniquely to these regions and so simple mappability-based filters do not remove them. Therefore the exclusion of these regions of artifact signal (aka blacklisting) is highly advised (see Hoffman et al., 2013; Kundaje, 2013; and Carroll et al., 2014).
However, even after blacklisting, Gord has noted that many cell lines and tumour samples show anomalous high signal in the input samples that are also replicated in the correspondent ChIP samples, and importantly, these are cell-type specific. He called these lists of high signal grey lists, to distinguish them from ENCODE’s black lists for two reasons:
- Firstly, because they represent regions of high artifact signals that are specific to your cell-type or sample
- Secondly, and quoting Gord directly on the package vignette: “Because they can be tuned depending on the stringency required, and so that we can make jokes about having 50 shades of grey lists”.
Greylisting is done in your input samples. You can use the R package GreyListChIP to identify these regions of artifact signal.
Creating custom greylists with GreyListChIP
Let’s put it to use:
In a previous post, we created the merged input files for MCF7 and A549 cell lines (a breast and a lung cancer cell line, respectively). Here is how one can compute the greylists using GreyListChIP R package:
library(GreyListChIP) library(BSgenome.Hsapiens.UCSC.hg19) library(ChIPQC)
#generate greylist gl.A549 <- greyListBS(BSgenome.Hsapiens.UCSC.hg19,"A549_input_merged.bam") gl.MCF7 <- greyListBS(BSgenome.Hsapiens.UCSC.hg19,"MCF7_input_merged.bam")
What does GreyListChIP do and how does it work?
Essentially, what GreyListChIP is doing is estimating a threshold on the read counts by which one should label a particular region of the genome has having “artifact” signal. Firstly, it fits a negative binomial distribution to the input sample and estimates the mu and size parameters. Then, given a p-value cutoff it computes the read count threshold.
In this example, I just used the package defaults to create my greylists. I used the function
greyListBS, which is a wrapper function for several steps of the implemented process for greylist construction. But, there are several parameters that you may want to use differently depending on the stringency criteria that you which to apply, so just have a look at the package vignette and the reference manual to explore it further.
Greylists vs. Blacklists – genome coverage
How cell-type specific are these generated greylists? And how different are they from the blacklists?
First, lets export our generated greylists into BED files:
We can also get the hg19 blacklist set from ChIPQC:
library(ChIPQC) data(blacklist_hg19) write.table(as.data.frame(blacklist.hg19), "blacklist.ChIPQC.bed", sep="t", quote=F, col.names=F, row.names=F)
To compare between multiple BED files I’ll use the
multiIntersectBed tool in BEDtools.
#This has to be run using BEDtools (outside R) multiIntersectBed -i blacklist.ChIPQC.bed MCF7_input_grey.bed A549_input_grey.bed > intersect.bed
The output (
intersect.bed) reports the presence/absence of each file at each interval that is covered by any one of the three files. Now we can use this file to do a venn diagram in R. I have made the following code below that works for this file in particular:
#Calculate the coverage of regions that are overlapped by the bed intervals a <- read.delim("intersect.bed", head=F) a$width <- a$V3 - a$V2 A <- sum(subset(a, V6 == 1 & V7 == 0 & V8 == 0, select="width")) B <- sum(subset(a, V6 == 0 & V7 == 1 & V8 == 0, select="width")) C <- sum(subset(a, V6 == 0 & V7 == 0 & V8 == 1, select="width")) AB <- sum(subset(a, V6 == 1 & V7 == 1 & V8 == 0, select="width")) BC <- sum(subset(a, V6 == 0 & V7 == 1 & V8 == 1, select="width")) AC <- sum(subset(a, V6 == 1 & V7 == 0 & V8 == 1, select="width")) ABC <- sum(subset(a, V6 == 1 & V7 == 1 & V8 == 1, select="width")) #total Kb covered by each set venn_vector <- c("blacklist"=A, "MCF7grey"=B, "blacklist&MCF7grey"=AB, "A549grey"=C, "MCF7grey&A549grey"=BC, "blacklist&A549grey"=AC, "blacklist&MCF7grey&A549grey"=ABC) / 1000
venn_vector gives us the total size of the genome covered (in Kb) by blacklists and greylists:
## blacklist MCF7grey ## 10594.715 40743.123 ## blacklist&MCF7grey A549grey ## 0.254 37579.979 ## MCF7grey&A549grey blacklist&A549grey ## 11353.757 733.920 ## blacklist&MCF7grey&A549grey ## 253.971
The venn diagram shows the portion of the genome covered by each set of regions:
With this venn diagram, we can see very clearly how the different sets of lists cover different parts of the genome, with little intersect between the genomic intervals identified in each set.
On a side note, I just want to mention that both greylists and blacklists are applicable to short-read sequencing data and not applicable to RNA-seq or other transcriptome data types.