I’ve been using DESeq2 for testing for differential expression between samples. It’s always worked great, but for some datasets it is not working correctly.

I have a DESeqResults object called mydata, and as you can see below, it shows zero differentially expressed genes:

summary(mydata) out of 36185 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 0, 0% LFC < 0 (down) : 0, 0% outliers [1] : 256, 0.71% low counts [2] : 3, 0.0083% (mean count < 0) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

Lets look at the distribution of p-values obtained from my data and compare it to the DESeq2 example dataset

#DEseq2 Example dataset library("DESeq2") library("airway") data("airway") dds = DESeqDataSet(airway, design = ~ cell + dex) dds = dds[ rowSums(counts(dds)) > 1, ] dds = DESeq(dds) DESeqdata = results(dds) #plot pvalues par(mfrow=c(1,2),mar=c(4,2,8,2)) hist(mydata$pvalue, main="my data", xlab="p-values",cex.main=1.5) hist(DESeqdata$pvalue, main="DESeq example data", xlab="p-values",cex.main=1.5) title(main="Do you spot the difference?", outer=T, line=-1.4, cex.main=1.5)

The p–value histogram of “correctly” computed p–values will have a rectangular shape with an enrichment of p–values near zero in the histogram, just like the one on the right (obtained from the DESeq example dataset).

For my dataset on the left, the histogram of p-value distribution looks quite different. It has what is called a “hill-shape”

The p-values were obtained with the Wald test. The Wald test is a test for coefficients in a regression model. It is based on a z–score, i.e. a N(0,1) distributed test statistic. The p-values are then adjusted for multiple testing according to the Benjamini–Hochberg rule to control the FDR.

The problem here is that the N(0,1) null distribution of the Wald test was not appropriate for my dataset. In particular, the assumed variance of the null distribution was too high, and therefore we see a hill-shaped histogram of p-values distribution (if it was too low we would see a U-shaped histogram). This issue becomes apparent especially when you have large variations within groups. In this case, estimating only one single dispersion estimate per gene (as DESeq does) is not appropriate

### Solving the issue with “empirical null modelling”

We can use the `fdrtool`

package to estimate the correct variance of the null–model from the test statistics. This package returns the estimated null variance, as well as estimates of various other FDR–related quantities and the p–values computed using the estimated null model parameters:

#firstly, remove genes filtered out by independent filtering and the dispersion outliers mydata <- mydata[ !is.na(mydata$padj), ] mydata <- mydata[ !is.na(mydata$pvalue), ] #secondly, use z–scores returned by DESeq2 as input to fdrtool library(fdrtool) corrected <- fdrtool(mydata$stat, statistic= "normal", plot = T) Step 1... determine cutoff point Step 2... estimate parameters of null distribution and eta0 Step 3... compute p-values and estimate empirical PDF/CDF Step 4... compute q-values and local fdr Step 5... prepare for plotting

lets look at the p-value distribution after fdrtool:

#plot p-values par(mfrow=c(1,2),mar=c(4,2,8,2)) hist(mydata$pvalue, main="my data before (using DESeq2)", xlab="p-values",cex.main=1.5) hist(corrected$pval, main="my data now (using fdrtool)", xlab="p-values",cex.main=1.5) title(main="Fixing overestimation of the variance", outer=T, line=-1.4, cex.main=1.5)

We now identify 110 differentially expressed genes at an FDR of 10%:

table(corrected$qval < 0.1) FALSE TRUE 35819 110