Removing low count genes for RNA-seq downstream analysis

The filtering of low-expression genes is a common practice in the analysis of RNA-seq data. There are several reasons for this. For the detection of differentially expressed genes (DEGs) and from a biological point of view, genes that not expressed at a biologically meaningful level in any condition are not of interest and are therefore best ignored, but also because it can increase the number of total DEGs after correction of multiple testing, improving sensitivity and the precision of DEGs after filtering (Bourgon et al., 2010).
In addition, genes/transcripts having read count within 0-10 rage are generally considered as artifacts or ‘noise’. From a statistical point of view, removing low count genes allows the mean-variance relationship in the data to be estimated with greater reliability (Law et al., 2016)

Methods for Filtering Low-Expression Genes in R

Lets load an example dataset first:

library(TCGAbiolinks)
data <-TCGAquery_recount2(project="GTEX", tissue="uterus")
x <- data[[1]]
x
class: RangedSummarizedExperiment 
dim: 58037 90 
metadata(0):
assays(1): counts
rownames(58037): ENSG00000000003.14 ENSG00000000005.5 ...
  ENSG00000283698.1 ENSG00000283699.1
rowData names(3): gene_id bp_length symbol
colnames(90): SRR808704 SRR1499584 ... SRR1383237 SRR1486215
colData names(82): project sample ... title characteristics

All datasets will include a mix of genes that are expressed and those that are not expressed. In fact, 1413 of genes in this dataset have zero counts across all ninety samples.

table(rowSums(assay(data[[1]])==0)==90)
FALSE  TRUE 
56624  1413

filterByExpr function by edgeR

The filterByExpr function in the edgeR package provides an automatic way to filter genes, while keeping as many genes as possible with worthwhile counts.

Roughly speaking, the strategy keeps genes that have at least min.count reads in a worthwhile number of samples. More precisely, the filtering keeps
genes that have count-per-million (CPM) above _k_ in _n_ samples, where _k_ is determined by min.count and by the sample library
sizes and _n_ is chosen according to the minimum group sample size. The use of CPM values rather than raw read counts avoids giving preference to samples with large library sizes.

In this case, I chose _k_=10, and since there are no groups, filterByExpr will automatically come up with a value for _n_ which is “always greater than 70% of the smallest group size.” The documentation is not very clear here, so I had a look in the code, and found out that _n_ = 66:

## part of filterByExpr source code
large.n = 10 # default
min.prop = 0.7  # default
MinSampleSize <- large.n + (MinSampleSize-large.n)*min.prop
MinSampleSize
## 66

Now, lets look at the complete code to filter low expressed genes:

keep.exprs <- filterByExpr(assay(x), min.count=10)
filt1 <- x[keep.exprs,]
dim(filt1)

Out of 58037 genes, we kept 33937 genes (58%):

## [1] 33937    90

My own function

To make things more complicated I will introduce my own function! The function shown below keeps rows (transcripts, genes) that have at least min.count reads in a minimum proportion of samples given by N. By default evaluates to ‘TRUE’ if >=90% (N=0.9 ) of the samples have count-per-million (CPM) above _k_, where _k_ is determined by the default value of min.count=10 and by the sample library
sizes.

selectGenes <- function(counts, min.count=10, N=0.90){

  lib.size <- colSums(counts)
  MedianLibSize <- median(lib.size)
  CPM.Cutoff <- min.count / MedianLibSize*1e6
  CPM <- edgeR::cpm(counts,lib.size=lib.size)

  min.samples <- round(N * ncol(counts))

  f1 <- genefilter::kOverA(min.samples, CPM.Cutoff)
  flist <- genefilter::filterfun(f1)
  keep <- genefilter::genefilter(CPM, flist)

  ## the same as:
  #keep <- apply(CPM, 1, function(x, n = min.samples){
  #  t = sum(x >= CPM.Cutoff) >= n
  #  t
  #})

  return(keep)
}

keep.exprs <- selectGenes(assay(x), min.count=10, N=0.90)
myFilt <- x[keep.exprs,]
dim(myFilt)

Out of 58037 genes, we kept 30342 genes (52%):

## [1] 30342    90

We can reproduce the results given by filterByExpr, if we consider that that _n_ = 66:

keep.exprs <- selectGenes(assay(x), threshold=10, N= 66/ncol(x) )
sum(keep.exprs)
## [1] 33937

TCGAanalyze_Filtering function by TCGAbiolinks

TCGAanalyze_Filtering allows user to filter genes/transcripts using two different methods:

  • method == “quantile”: filters out those genes with mean across all samples, smaller than the threshold. The threshold is defined as the quantile of the rowMeans qnt.cut = 0.25 (by default 25% quantile) across all samples.
  • datNorm <- recount::scale_counts(x, by = "auc", round = FALSE)
    filt2 <- TCGAanalyze_Filtering(assay(datNorm), method = "quantile")
    dim(filt2)
    

    Out of 58037 genes, we kept 43527 genes (75%):

    [1] 43527    90
    
  • method == “varFilter”: This method filters out those genes with IQR across all samples, smaller than the threshold. The threshold is defined as the quantile of the rowIQRs var.cutoff = 0.75 (by default 75% quantile) across all samples. Here, the function IQR is used by default, but any other function can be used by changing the parameter var.func = IQR.
  • datNorm <- recount::scale_counts(x, by = "auc", round = FALSE)
    filt3 <- TCGAanalyze_Filtering(assay(datNorm), method = "varFilter")
    dim(filt3)
    

    This method is way more restrictive, filtering out the majority of the genes:

    [1] 14509    90
    

    Using genefilter

    Actually, TCGAanalyze_Filtering varFilter method is a wrapper around genefilter. We can recapitulate the previous examples by calling genefilter function directly:

    filt4 <- genefilter::varFilter(assay(datNorm), var.func = IQR, 
                var.cutoff = 0.75, filterByQuantile = TRUE)
    dim(filt4)
    
    [1] 14509    90
    

    A more relaxed threshold can be used by doing:

    filt5 <- genefilter::varFilter(assay(datNorm), var.func = IQR, 
                var.cutoff = 0.25, filterByQuantile = TRUE)
    dim(filt5)
    
    [1] 43462    90
    

    Which way is best?

    Here, I showed four different ways of filtering out low expressed genes (filterByExpr, selectGenes, TCGAanalyze_Filtering/quantile, varFilter).
    There is no universal optimal filtering thresholds for all pipelines, this has been shown in previous comparative studies such as in Sha et 2016 and Bourgon et al., 2010). Both have argued that the chosen method should be the one that maximized the number of DEGs.

    In many cases, I am not performing DEGs, so I just tend to use always the same approach, which is the one I show above selectGenes.

    We can compare, for a given sample, the distribution log-CPM values. The plot below shows a sizeable proportion of genes in the “SRR808704” sample have very low log-CPM values.

    keep.exprs <- selectGenes(assay(x), min.count=10, N=0.90)
    myFilt <- x[keep.exprs,]
    dim(myFilt)
    
    
    par(mfrow=c(1,2))
    plot(density(cpm(assay(x),log=TRUE)[,"SRR808704"]), main="before filtering", xlab="log CPM")
    plot(density(cpm(assay(myFilt),log=TRUE)[,"SRR808704"]), main="after filtering", xlab="log CPM")
    

    9 responses to “Removing low count genes for RNA-seq downstream analysis

    1. good article but just for curiosity why should I remove genes in the first place and why not independent filtering of DESeq2 do the job.

      Independent filtering: DESeq2 uses the average expression strength of each gene, across all samples, as its filter criterion, and it omits all genes with mean normalized counts below a filtering threshold from multiple testing adjustment.

      “1.3.5 Pre-filtering
      While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are no reads or nearly no reads, we reduce the memory size of the dds data object and we increase the speed of the transformation and testing functions within DESeq2. ”

      Let’s say if I think to do the filtering to improve speed and memory of the job then it can be achieved by only removing genes with rowsum of zero

      Liked by 1 person

    2. Hi Ines,

      Thanks for your great post and for providing your function “selectGenes”. However, I still have a concern regarding that. I was always wondering why genes are filtered out regardless of which sample and which group they belong to! Consider a study with two conditions (diseased vs normal) with 4 samples in each group and the following read counts for a single gene.
      Normal: 30 6 32 28
      Diseased: 4 6 3 5
      In such a case, selectGenes and many other functions would remove this gene. However, the gene is apparently underexpressed in diseased samples and should not be removed! Accordingly, I modified your function a little bit so that it accounts for the expression of each gene in each group of samples rather than across study (whole samples).

      Please see below, where group.indices accounts for a list of indices (column numbers) for each group with the same length as the number of groups/conditions. For example, if
      colnames(counts) <- normal normal normal diseased diseased diseased
      then
      group.indices <- list(c(1:3), c(4:6))

      selectGenes <- function(counts, group.indices, min.count=10, N=0.70){

      if(sum(sapply(group.indices, length)) != ncol(counts)) {
      stop("The cumulative number of samples in group.indices argument should be
      the same as the number of samples (columns) in counts argument",
      call. = FALSE)
      }

      lib.size <- colSums(counts)
      MedianLibSize <- median(lib.size)
      CPM.Cutoff <- min.count / MedianLibSize*1e6
      CPM <- edgeR::cpm(counts,lib.size=lib.size)

      #detect genes that have min.count in at least samples.length of each group
      keep <- vector(mode = "integer")
      for (i in 1:length(group.indices)) {
      samples.length <- round(N * length(group.indices[[i]]))
      temp.keep = CPM.Cutoff) >= n
      t
      }
      )
      temp.keep <- which(temp.keep)
      keep <- append(keep, temp.keep)
      }

      keep <- unique(keep)
      return(keep)
      }

      Like

      • I have no idea why the function I commented was inadvertently cut! Here’s the correct and complete version of the function:

        selectGenes <- function(counts, group.indices, min.count=10, N=0.70){

        if(sum(sapply(group.indices, length)) != ncol(counts)) {
        stop("The cumulative number of samples in group.indices argument should be
        the same as the number of samples (columns) in counts argument",
        call. = FALSE)
        }

        lib.size <- colSums(counts)
        MedianLibSize <- median(lib.size)
        CPM.Cutoff <- min.count / MedianLibSize*1e6
        CPM <- edgeR::cpm(counts,lib.size=lib.size)

        #detect genes that have min.count in at least samples.length of each group
        keep <- vector(mode = "integer")
        for (i in 1:length(group.indices)) {
        samples.length <- round(N * length(group.indices[[i]]))
        temp.keep = CPM.Cutoff) >= n
        t
        }
        )
        temp.keep <- which(temp.keep)
        keep <- append(keep, temp.keep)
        }

        keep <- unique(keep)
        return(keep)
        }

        Like

        • I think something’s wrong with WordPress as it has cut again a part of my comment!

          Here’s the correct and complete version of the function: Hope this works this time!

          selectGenes <- function(counts, group.indices, min.count=10, N=0.70){
          if(sum(sapply(group.indices, length)) != ncol(counts)) {
          stop("The cumulative number of samples in group.indices argument should be
          the same as the number of samples (columns) in counts argument",
          call. = FALSE)
          }

          lib.size <- colSums(counts)
          MedianLibSize <- median(lib.size)
          CPM.Cutoff <- min.count / MedianLibSize*1e6
          CPM <- edgeR::cpm(counts,lib.size=lib.size)

          #detect genes that have min.count in at least samples.length of each group
          keep <- vector(mode = "integer")
          for (i in 1:length(group.indices)) {
          samples.length <- round(N * length(group.indices[[i]]))
          temp.keep = CPM.Cutoff) >= n
          t
          }
          )
          temp.keep <- which(temp.keep)
          keep <- append(keep, temp.keep)
          }
          keep <- unique(keep)
          return(keep)
          }

          Like

    3. Hi Adrian,
      Thanks for this suggestion, this is a very good idea!
      I will definitely use it myself, you are right removing genes without considering the groups or contrasts may lead to loosing some important genes.
      We should consider this scenario , especially if the total sample size (N) is small. In the case of smaller N it is more likely to have a bigger proportion of samples where a given gene is absent simply because those samples come from the same condition where we expected the gene to be down-regulated

      Like

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google photo

    You are commenting using your Google account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s