How much sequencing coverage do I need?

How many times have you asked yourself this question?
Possibly, far too many! I found a really nice coverage guide put together by Genohub . Much of the data was collected from published coverage saturation experiments, and you can also find all paper references for each guideline in the table. I think it really helps for quick assessments.

In short, it all depends on what you are sequencing (ChIP-seq, RNA-seq, Whole genome sequencing, etc) and what for (DEG, peaks, alternative splicing, SNP calling, INDEL calling, etc).

ChIP-seq guidelines for mammalian species

For instance the Encode project recommends, for ChIP-seq data, about 10 million mapped reads for datasets with sharp peaks (e.g. transcription factors), and about 20 million are considered adequate for broader peaks such as those obtained from histone modifications.

RNA-seq guidelines for mammalian species

Similarly, if you are dealing with RNA-seq data, you might consider doing a differential gene expression analysis, for which you would need to about 10M-25M reads, but much more is needed for alternative splicing and allele-specific expression analysis – around 50M to 100M reads (Liu Y. et al., 2013; 2014).

Coverage guidelines for genotyping, SNP, SNV and INDEL calling

When it comes to genotype, SNP and INDEL calling, the guidelines are often in terms of coverage. Here the coverage means the following:

(number of reads x read length / target size)

For genotyping using whole genome sequencing data you should aim to 35X (uniform) coverage at least (Ajay et al., 2011). For more, SNV and INDEL calling, Check out Genohub’s guidelines here.

A few things to notice:

  1. Genome size: Firstly, while the coverage recommendations should apply across species, the read guidelines mentioned above only apply to mammalian species with genome sizes of ~3Gb. You can scale the number of reads proportionately to the genome size you are working with.
  2. Genome complexity: Secondly, it is also important to notice that even meeting all coverage guidelines, your ability to call variants will be affected by the breadth and uniformity of the read coverage. For instance, with Exome sequencing not all targeted regions are captured at the same efficiency, and so a few variability is introduced in the uniformity of coverage. The efficiency of the capture can be affected by genome complexity, repetitive elements, GC content, etc. There is a good blog post by EdgeBio discussing issues with Exome coverage and providing some advice.
  3. Number of replicates: Thirdly, for differential gene expression (DEG) analysis, sequencing more biological replicates will be more valuable than sequencing deeper. Liu Y., et al., 2014 show that really clearly in their paper – see figure 1B below. At 10M reads, an increase in biological replicates from 2 to 7 increases the power from ~45% to 100%, whereas increasing sequencing reads from 10M to 30M has much smaller returns. Increasing the number of replicates is valid for DEG analysis because we are essentially capturing biological variability between conditions, and thus having more replicates will help us distinguishing true biological signal from noise, even at lower depths.

    Screen Shot 2015-05-26 at 01.28.27

    Having said this, sequencing deeper is valuable and not replaceable by more replicates for particular cases such as for detecting rare transcripts (transcriptomes) or rare genomes (metagenomes), or for SNP/INDEL calling or genotyping.

  4. Read length: Fourthly, when considering those guidelines, how important is read length? Read length is already taken into account for X coverage calculation (see formula above). It makes sense, the longer the read, the higher the percentage of the genome covered. However, some guidelines are given in terms of millions of reads sequenced. For instance, Liu et al., 2014., refer to 10M-25M reads for DEG analysis and performed their study with 50bp single-end RNA-seq reads.
    We know that longer reads are particularly helpful for more accurate read mapping due to reduced ambiguity in assigning reads, or for spanning of gene fusions or splicing events. I think that the benefits of sequencing longer reads (50bp or more?) will ultimately lead to improved power of any analysis based on read counts, but I haven’t yet seen an analysis showing how exactly read length and read type (single-end, paired-end) will impact on those guidelines.

  5. Other technical variations: The same argument is true for other technical variations, for instance coverage and read requirements can be different for other sequencing instruments or methodologies that introduce different error rates.
  6. Saturation analysis: Finally, the values described are meant to serve as guidelines only. To determine the necessary number of sequenced reads in a more accurate way, it is advised to perform a saturation analysis. For ChIP-seq experiments the method described in this paper by Zuo, C. and Keles, S. (2012) is a good starting point.

.

Advertisements

2 responses to “How much sequencing coverage do I need?

  1. Thanks for the very useful information. I have very quick question. I carry out Chip-seq on TFs in complete media (full media) and I think the cells end up in different cell cycle so The chip-seq doesn’t work so well. I got around 40 million reads/sample but the peak is still very weak. Do you think re-sequencing sample with deeper depth (100 millions) will help to observe stronger peak?

    Like

    • Hi Le, short answer: No, don’t sequence deeper.
      There are two reasons why you don’t observe ‘good/strong’ peaks. One could be because of coverage, but you already sequenced 40million reads and this should be enough to detect ChIP-seq peaks . The other reason could be that your ChIP experiment didn’t give you a good enrichment (the antibody didn’t pool out the fragments with enough efficiency, or there is a lot of noise in the data when compared to the input). I wouldn’t sequence any deeper before checking the quality of your enrichment in the ChIP experiment. If you have a low complexity library (i.e. not many unique DNA fragments, typical of poorly enriched libraries), then sequencing deeper at this stage might produce many read duplicates, and this won’t improve your peak detection in any way, mainly because peak finders will get rid of read duplicates and so sequencing a low complex library deeper wouldn’t add any more ‘information’. You can check the quality of your ChIP enrichment by running ChIPQC R package (http://bioconductor.org/packages/release/bioc/html/ChIPQC.html) and checking a number of ChIP-seq enrichment metrics. The ChIP-seq enrichment metrics were first proposed by the ENCODE project (https://genome.ucsc.edu/ENCODE/qualityMetrics.html) and a number of them were implemented in the ChIPQC package, in particular the SSD, the relativeCC and the FRiP (% of reads in peaks or Fraction of Reads In Peaks). I wrote a small blog post about ChIPC (https://seqqc.wordpress.com/2015/02/02/assessing-chip-seq-sample-quality-with-chipqc-4/), but I would advise you to go through the ChIPQC vignette also (http://bioconductor.org/packages/release/bioc/vignettes/ChIPQC/inst/doc/ChIPQC.pdf), and try to run those examples there to learn how to interpret the plots. Then you will have a more comprehensive overview of what went wrong with your ChIP!

      Best,
      Ines

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s