How many times have you asked yourself this question?
Possibly, far too many! I found a really nice coverage guide put together by Genohub . Much of the data was collected from published coverage saturation experiments, and you can also find all paper references for each guideline in the table. I think it really helps for quick assessments.
In short, it all depends on what you are sequencing (ChIP-seq, RNA-seq, Whole genome sequencing, etc) and what for (DEG, peaks, alternative splicing, SNP calling, INDEL calling, etc).
ChIP-seq guidelines for mammalian species
For instance the Encode project recommends, for ChIP-seq data, about 10 million mapped reads for datasets with sharp peaks (e.g. transcription factors), and about 20 million are considered adequate for broader peaks such as those obtained from histone modifications.
RNA-seq guidelines for mammalian species
Similarly, if you are dealing with RNA-seq data, you might consider doing a differential gene expression analysis, for which you would need to about 10M-25M reads, but much more is needed for alternative splicing and allele-specific expression analysis – around 50M to 100M reads (Liu Y. et al., 2013; 2014).
Coverage guidelines for genotyping, SNP, SNV and INDEL calling
When it comes to genotype, SNP and INDEL calling, the guidelines are often in terms of coverage. Here the coverage means the following:
(number of reads x read length / target size)
A few things to notice:
- Genome size: Firstly, while the coverage recommendations should apply across species, the read guidelines mentioned above only apply to mammalian species with genome sizes of ~3Gb. You can scale the number of reads proportionately to the genome size you are working with.
- Genome complexity: Secondly, it is also important to notice that even meeting all coverage guidelines, your ability to call variants will be affected by the breadth and uniformity of the read coverage. For instance, with Exome sequencing not all targeted regions are captured at the same efficiency, and so a few variability is introduced in the uniformity of coverage. The efficiency of the capture can be affected by genome complexity, repetitive elements, GC content, etc. There is a good blog post by EdgeBio discussing issues with Exome coverage and providing some advice.
- Number of replicates: Thirdly, for differential gene expression (DEG) analysis, sequencing more biological replicates will be more valuable than sequencing deeper. Liu Y., et al., 2014 show that really clearly in their paper – see figure 1B below. At 10M reads, an increase in biological replicates from 2 to 7 increases the power from ~45% to 100%, whereas increasing sequencing reads from 10M to 30M has much smaller returns. Increasing the number of replicates is valid for DEG analysis because we are essentially capturing biological variability between conditions, and thus having more replicates will help us distinguishing true biological signal from noise, even at lower depths.
Having said this, sequencing deeper is valuable and not replaceable by more replicates for particular cases such as for detecting rare transcripts (transcriptomes) or rare genomes (metagenomes), or for SNP/INDEL calling or genotyping.
- Read length: Fourthly, when considering those guidelines, how important is read length? Read length is already taken into account for X coverage calculation (see formula above). It makes sense, the longer the read, the higher the percentage of the genome covered. However, some guidelines are given in terms of millions of reads sequenced. For instance, Liu et al., 2014., refer to 10M-25M reads for DEG analysis and performed their study with 50bp single-end RNA-seq reads.
We know that longer reads are particularly helpful for more accurate read mapping due to reduced ambiguity in assigning reads, or for spanning of gene fusions or splicing events. I think that the benefits of sequencing longer reads (50bp or more?) will ultimately lead to improved power of any analysis based on read counts, but I haven’t yet seen an analysis showing how exactly read length and read type (single-end, paired-end) will impact on those guidelines.
- Other technical variations: The same argument is true for other technical variations, for instance coverage and read requirements can be different for other sequencing instruments or methodologies that introduce different error rates.
- Saturation analysis: Finally, the values described are meant to serve as guidelines only. To determine the necessary number of sequenced reads in a more accurate way, it is advised to perform a saturation analysis. For ChIP-seq experiments the method described in this paper by Zuo, C. and Keles, S. (2012) is a good starting point.