Should you transform RNA-seq data: Log, VST, voom?

When I am analysing RNA-seq data, there are two RNA-seq specific properties that I always keep in mind:

  1. The presence of extreme values
  2. The mean-variance dependency (also known as heteroscedasticity)

These two properties are important to consider, depending on the analysis. For instance, if you are doing differential gene expression analysis with DESeq or edgeR this is not an issue. In this case you will use the raw, integer read counts, without any transformation. The reason why you won’t need to transform your data is because DESeq and edgeR were developed specifically for RNA-seq data and so they take care of those type of properties internally. However there are many other applications of RNA-seq that were historically developed with Microarray data in mind, and are therefore more likely to be influenced by the highly skewed nature of RNA-seq data. For instance, you may want to consider data transformation when building prognostic gene signatures (read Zwiener et al., 2014 for more details) or when building gene co-regulatory networks especially if using Pearson’s correlation (because PC results are heavily affected by outliers).

Extreme values and skewed distribution

In the literature it is not unusual to see RNA-seq values represented in the log scale. In fact, the logarithm transformation will get rid of some extreme values. Another frequently used method is the variance-stabilizing transformation (VST). VST for RNA-seq data was purposed by Anders and Huber, 2010 and it is implemented in the DESeq package.

I applied both transformations on a RNA-seq dataset obtained from 200 TCGA lung cancer samples (you can download this example dataset from here)

vst <- function(countdata){
  library(DESeq)
  condition <- factor(rep("Tumour", ncol(countdata)))
  countdata <- newCountDataSet(countdata,condition )
  countdata <- estimateSizeFactors( countdata )
  cdsBlind <- DESeq::estimateDispersions( countdata, method="blind")
  vstdata <- varianceStabilizingTransformation( cdsBlind )
  return(exprs(vstdata))
}

load("rnaseq_lusc_example_SeqQC.Rda")
data.log2 <- log2(data+1)
data.vst <- vst(data)
dim(data)
[1] 15988   200

Taking an arbitrary gene, we can see that the count data follows a skewed distribution with a large dynamic range and a large tail of “extreme values” to the right of the distribution. Both log and VST get rid of some extream values.

par(mfrow=c(1,3))
plot(density(as.numeric(data[1,])), main="counts", cex.main=2)
plot(density(as.numeric(data.log2[1,])), main="log2", cex.main=2)
plot(density(as.numeric(data.vst[1,])), main="VST", cex.main=2)

skewPlot-1

Heteroscedasticity

In RNA-Seq data, genes with larger average expression have on average larger observed variances across samples, that is, they vary in expression from sample to sample more than other genes with lower average expression. This phenomena is known as data heteroscedasticity. Heteroscedasticity is the opposite of homoscedasticity, and literally means “having the different scatter.” This means that data values that are scattered, or spread out, to different extents for different genes.

This effect is clearly visible when we plot the per-gene standard deviation (taken across the 200 samples), against the rank of the average expression. The plot below was taken by using the meanSdPlot function from the vsn package and demonstrates the mean-variance dependency clearly. After the log transformation there are still unequal variances across samples especially for the lower counts (showing an elevated sd). After VST the sd becomes more constant along the whole dynamic range, but there are still unequal for all genes.

par(mfrow=c(1,3))
library(vsn)
meanSdPlot(as.matrix(data),ylim=c(0,8e3),main="counts")
meanSdPlot(as.matrix(data.log2), ylim = c(0,2.5),main="log2")
meanSdPlot(as.matrix(data.vst), ylim = c(0,2.5),main="VST")

Log vs. VST: Which one would you choose?

It depends! One thing I usually do is to use the log transformation (I mean the shifted logarithm log2(n + 1)) when I want to have a quick visual inspection of my data. I also do a lot of network inference from RNA-seq data, and for that I use VST.
When deciding between the log and the VST there are a few of things to consider:

  1. After the log transformation there are less extreme values when compared to untransformed data, but there are still unequal variances
  2. After VST the per-gene standard deviation becomes more constant along the whole dynamic range, but note that the variances are still unequal for all genes
  3. An additional problem of the log2 transformation is that log2 of zero is infinite! To avoid taking the logarithm of zero it is common to add a pseudo value of 1 prior taking the log. And, of course, we have to assume that adding 1 does not bias much the low non-zero counts
  4. For lower counts, the log transformation is very dramatic. VST leads to less variable values across the whole dynamic range (see below).
condition <- factor(rep("Tumour", ncol(data)))
countdata <- newCountDataSet(data,condition )
countdata <- estimateSizeFactors( countdata )
px <- counts(countdata)[,1]
ord <- order(px)
# this code in based on http://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

# plot the VST and f(n)=log2 transformation (x-axis shows the RSEM counts)
matplot(px[ord], cbind(data.vst[, 1], log2(px))[ord, ], type="l", lty=1, col=c("blue", "black"), xlab="n", ylab="f(n)", xlim=c(0,1000), ylim=c(0,10))
legend("bottomright", legend = c(expression("VST"),expression(log[2])),fill=c("blue", "black"))

distplot-1

In summary, both are popular transformation techniques, and the decision depends on your application. Log is simple and easy to interpret, but it does not let you take the log of zero and it may not correct for heteroscedasticity completely. VST may be the prefered solution for your application. If you want to read more on this issue, there is a 2014 paper by Isabella Zwiener et al., Plos One, 2014 that discusses the impact of log, VST and many other transformation techniques in multivariable modeling.

Voom

This week, I was discussing this issue with a colleague of mine, Oscar Rueda, he also pointed out to the voom transformation (from Limma package), published by Charity Law et al., Genome Biology, 2014.

From limma package vignette:

The voom transformation is applied to the read counts. This converts the counts to log-counts per million with associated precision weights. After this, the RNA-seq data can be analyzed as if it was microarray data. This means for example that any of the linear modelling or gene set testing methods in the limma package can be applied to RNA-seq data.

— You might want to try out voom too! I will! 🙂

Advertisements

12 responses to “Should you transform RNA-seq data: Log, VST, voom?

  1. Hi, thank you for this post, it was really helpful for understanding the transformations. I have a question about your data. Is this the level 3 data from TCGA, are these the raw_counts from the files genes.results or are these the normalized counts from genes.normalized_results.

    Thanks

    Oscar

    Liked by 1 person

  2. Hello, thank you for this post! 🙂 I have been looking at the same paper that you were referring to – it was hard to decide the transformation method to follow. This cleared some of the concepts for me. Can you kindly explain why I should take raw counts and not the normalized ones ? the normalisation is more to do with the length of the gene is it not ?

    thank you,
    Srividya

    Like

  3. I want to apply quartile normalization on my rnaseq counts data, should i do count +1 before or after log2 transformation? I am thinking if I add 1 after normalization, it wouldn’t make sense as some normalized read counts can be really small (i.e. 0.0001) , therefore a log2(0.0001) versus log2(1.0001) would be a huge difference.

    Like

    • +1 makes sense when you have raw counts because it does not affect the distribution of counts, since the raw counts can go from 0 to thousands, and the difference from 0 and 1 read is really small, so adding +1 doesn’t affect the distribution. In your case, you need to choose the appropriate pseudo count. I have seen papers where +0.01 is applied after normalising with RPKM. To find a better pseudo count for your normalised data, you can plot the normalised values ( use plot(density(x) for instance) before and after adding the pseudo count (plot(density(x+0.01)) and (plot(density(x+1)), and see if how the density distribution changes for the lower counts, I am guessing 0.01 would be more appropriate for you, but check before you do it.

      Like

  4. Are any of these transformations corrected for gene length? Say my raw counts come from featureCounts or HTSeq etc. Can I look at my transformed data (rlog, vst, voom etc) and say that one gene is more expressed than another?

    Like

  5. Thanks for your post, i found it really helpful! I was wondering if you could help me 🙂 I currently have my RNAseq output in the form of FPKM values however they’re not normally distributed. I want to transform them to the natural logarithm so that i can run parametric statistics on candidate transcripts. Do you have any comments on doing it this way? Thanks!

    Like

  6. Thank you for sharing a detailed comparative insight of RNA-seq data transformation using commonly used methods.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s