Too few (or none) differential expressed genes? A way to fix the null hypothesis in DESeq2

I’ve been using DESeq2 for testing for differential expression between samples. It’s always worked great, but for some datasets it is not working correctly.

I have a DESeqResults object called mydata, and as you can see below, it shows zero differentially expressed genes:


out of 36185 with nonzero total read count
adjusted p-value < 0.1 LFC > 0 (up) : 0, 0%
LFC < 0 (down) : 0, 0%
outliers [1] : 256, 0.71%
low counts [2] : 3, 0.0083%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Lets look at the distribution of p-values obtained from my data and compare it to the DESeq2 example dataset

#DEseq2 Example dataset
dds = DESeqDataSet(airway, design = ~ cell + dex)
dds = dds[ rowSums(counts(dds)) > 1, ]
dds = DESeq(dds)
DESeqdata = results(dds)

#plot pvalues
hist(mydata$pvalue, main="my data", xlab="p-values",cex.main=1.5)
hist(DESeqdata$pvalue, main="DESeq example data", xlab="p-values",cex.main=1.5)
title(main="Do you spot the difference?", outer=T, line=-1.4, cex.main=1.5)

The p–value histogram of “correctly” computed p–values will have a rectangular shape with an enrichment of p–values near zero in the histogram, just like the one on the right (obtained from the DESeq example dataset).
For my dataset on the left, the histogram of p-value distribution looks quite different. It has what is called a “hill-shape”

The p-values were obtained with the Wald test. The Wald test is a test for coefficients in a regression model. It is based on a z–score, i.e. a N(0,1) distributed test statistic. The p-values are then adjusted for multiple testing according to the Benjamini–Hochberg rule to control the FDR.

The problem here is that the N(0,1) null distribution of the Wald test was not appropriate for my dataset. In particular, the assumed variance of the null distribution was too high, and therefore we see a hill-shaped histogram of p-values distribution (if it was too low we would see a U-shaped histogram). This issue becomes apparent especially when you have large variations within groups. In this case, estimating only one single dispersion estimate per gene (as DESeq does) is not appropriate

Solving the issue with “empirical null modelling”

We can use the fdrtool package to estimate the correct variance of the null–model from the test statistics. This package returns the estimated null variance, as well as estimates of various other FDR–related quantities and the p–values computed using the estimated null model parameters:

#firstly, remove genes filtered out by independent filtering and the dispersion outliers
mydata <- mydata[ !$padj), ]
mydata <- mydata[ !$pvalue), ]

#secondly, use z–scores returned by DESeq2 as input to fdrtool
corrected <- fdrtool(mydata$stat, statistic= "normal", plot = T)

Step 1... determine cutoff point
Step 2... estimate parameters of null distribution and eta0
Step 3... compute p-values and estimate empirical PDF/CDF
Step 4... compute q-values and local fdr
Step 5... prepare for plotting

lets look at the p-value distribution after fdrtool:

#plot p-values
hist(mydata$pvalue, main="my data before (using DESeq2)", xlab="p-values",cex.main=1.5)
hist(corrected$pval, main="my data now (using fdrtool)", xlab="p-values",cex.main=1.5)
title(main="Fixing overestimation of the variance", outer=T, line=-1.4, cex.main=1.5)

We now identify 110 differentially expressed genes at an FDR of 10%:

table(corrected$qval < 0.1)
35819 110

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s