Calculate nucleotide frequency with Rsamtools pileup

Rsamtools is great! it makes it easy to do sequencing analysis in R giving a very nice interface between R and BAM files produced by tools like samtools, bcftools, and tabix.

Rsamtools pileup

Rsamtools pileup is one of my favorite functions. It is great because it can reduce large amounts of data contained in BAM files to the relevant data for my analysis.

I have created some example ChIP-seq (bam and bam.bai) and SNPs (vcf file) data for demonstration in this post.

First, we need to load the BAM file into R. When loading large BAM files into R we can specify the portion of the genome of interest with the which argument of ScanBamParam. Here I used the specify only one single position of the genome (chr17:685640):

## load bam file using the which argument to ScanBamParam
library(Rsamtools)
bamfile <- "Pol2ChIP_chr17.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges("chr17",
IRanges(start=685640, end=685640)))

Now, we can use the pileup commad to get the pile-up the specified single position at chr17. With default pileup parameters, as follows:

## pileup for on single position at chr17 with default pileup parameters
pileup(bf, scanBamParam=param)

Pileup returns a data frame of the number of alignments overlapping each genomic position. The ‘count’ column shows the number of reads overlapping that position. I specifically chose an heterozygous SNP, and you can see that reads carry either the ‘G’ (93+111 counts) or the ‘T’ (29+24 counts) allele. The ‘A’ allele (1 count) probably represents a mismatch:

  seqnames    pos strand nucleotide count         which_label
1    chr17 685640      -          A     1 chr17:685640-685640
2    chr17 685640      +          G    93 chr17:685640-685640
3    chr17 685640      -          G   111 chr17:685640-685640
4    chr17 685640      +          T    29 chr17:685640-685640
5    chr17 685640      -          T    24 chr17:685640-685640

Use PileupParam to specify filtering criteria

There are a number of arguments available through the PileupParam constructor that enable you to filter the pileup results in many different ways.
Have a look at all available options by doing ?pileup or by checking Rsamtools reference manual.

I like to use a bigger max_depth because I feel that the default option is a little bit to stringent. max_depth refers to the maximum number of overlapping reads considered for each position in the pileup and it is capped at 250 reads by default. This sort of limit is a good idea for memory allocation reasons and, if samples are low coverage, it won’t make a difference.

But, for the example I chose here in this post, it does actually make a difference. This is how you can adjust the max_depth argument:

## pileup for on single position at chr17 with max_depth
p_param <- PileupParam(max_depth=1000)
pileup(bf, scanBamParam=param, pileupParam=p_param)

It gives different counts than before at nucleotides with high coverage, and the “C” nucleotide suddenly appears (go figure!). I checked the read counts manually, and this is the correct number of reads:
(I always set a very large limit to max_depth because of this issue. I appreciate comments below if you have any! thanks!)

  seqnames    pos strand nucleotide count         which_label
1    chr17 685640      -          A     1 chr17:685640-685640
2    chr17 685640      -          C     1 chr17:685640-685640
3    chr17 685640      +          G   153 chr17:685640-685640
4    chr17 685640      -          G   159 chr17:685640-685640
5    chr17 685640      +          T    40 chr17:685640-685640
6    chr17 685640      -          T    31 chr17:685640-685640

Turning off the distinguish_strand may also be useful, depending on your application:

## nucleotide at a genomic position to be included in results.
p_param <- PileupParam(max_depth=1000, distinguish_strand=FALSE)
pileup(bf, scanBamParam=param, pileupParam=p_param)
  seqnames    pos nucleotide count         which_label
1    chr17 685640          A     1 chr17:685640-685640
2    chr17 685640          C     1 chr17:685640-685640
3    chr17 685640          G   312 chr17:685640-685640
4    chr17 685640          T    71 chr17:685640-685640

Any combination of the filtering criteria is possible. Lets say we want a pileup that only counts reads with mapping quality >= 15, and bases with quality >= 10 that appear at least 5 times at each genomic position:

p_param <- PileupParam(max_depth=1000, 
min_nucleotide_depth=5, 
distinguish_strand=FALSE,
min_mapq=15,
min_base_quality=10)
pileup(bf, scanBamParam=param, pileupParam=p_param)

This gives:

  seqnames    pos nucleotide count         which_label
1    chr17 685640          G   231 chr17:685640-685640
2    chr17 685640          T    37 chr17:685640-685640

You can also compute the pileup for several SNPs.
To demonstrate, lets load the example vcf file into R and convert it into a GenomicRanges object:

#Read a VCF file
library(VariantAnnotation)
vcf <- readVcf("vcfExample_chr17.vcf", "hg19")
vcf.ranges <- rowData(vcf)

Now, using the which argument of scanBamParam, you can specify the vcf file as your region of interest:

bamfile <- "Pol2ChIP_chr17.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=vcf.ranges)

the pileup result is a data.frame object of 186 rows:

p_param <- PileupParam(max_depth=1000, ignore_query_Ns=FALSE)
res <- pileup(bf, scanBamParam=param, pileupParam=p_param)
dim(res)

[1] 186   6

head(res)
  seqnames    pos strand nucleotide count         which_label
1    chr17 685640      -          A     1 chr17:685640-685640
2    chr17 685640      -          C     1 chr17:685640-685640
3    chr17 685640      +          G   153 chr17:685640-685640
4    chr17 685640      -          G   159 chr17:685640-685640
5    chr17 685640      +          N     1 chr17:685640-685640
6    chr17 685640      +          T    40 chr17:685640-685640

the plot below, show the counts at each nucleotide position:

plot(res$nucleotide, res$count ,pch=".", log="y", ylab="count (log scale)")

allelecounts

Create a table of A,C,G,T nucleotide frequencies at each SNP position

I have done a simple R function, that converts the pileup output into a table of nucleotide frequencies. Check it out:

pileupFreq <- function(pileupres) {
    nucleotides <- levels(pileupres$nucleotide)
    res <- split(pileupres, pileupres$seqnames)
    res <- lapply(res, function (x) {split(x, x$pos)})
    res <- lapply(res, function (positionsplit) {
        nuctab <- lapply(positionsplit, function(each) {
                        chr = as.character(unique(each$seqnames))
                        pos = as.character(unique(each$pos))
                        tablecounts <- sapply(nucleotides, function (n) {sum(each$count[each$nucleotide == n])})
                        c(chr,pos, tablecounts)
                    })
        nuctab <- data.frame(do.call("rbind", nuctab),stringsAsFactors=F)
        rownames(nuctab) <- NULL
        nuctab
    })
    res <- data.frame(do.call("rbind", res),stringsAsFactors=F)
    rownames(res) <- NULL
    colnames(res) <- c("seqnames","start",levels(pileupres$nucleotide))
    res[3:ncol(res)] <- apply(res[3:ncol(res)], 2, as.numeric)
    res
}

The function creates a table of nucleotide frequencies at each SNP position:

pileupFreq(res)

   seqnames    start  A   C   G   T N = -
1     chr17   685640  1   1 312  71 1 0 0
2     chr17   990890 20   1   9   0 0 0 0
3     chr17  4458790 13   0  65   0 0 0 0
4     chr17  4839149  0   1   0   3 0 0 0
5     chr17  4852463  0  44   0   4 0 0 0
6     chr17  5323269  0   4   0   3 0 0 0
7     chr17  5323631  8   0  11   0 0 0 0
8     chr17  7482779  2  42   1   7 0 0 0
9     chr17  7620907  0   4  21   0 0 0 0
10    chr17  8057572  0  28   1   8 0 0 0
11    chr17 16341867  2   0  10   0 0 0 0
12    chr17 16346757  9   5   0   0 0 0 0
13    chr17 18267039  0   0  88  12 0 0 0
14    chr17 25660189  0   4   0   6 0 0 0
15    chr17 26554775  1  26   0   6 0 0 0
16    chr17 29876421  3   0  13   1 0 0 0
17    chr17 34842521 10   1  51   2 0 0 0
18    chr17 38255594  1   0   6   0 0 0 0
19    chr17 38618104  0  37   0  19 0 0 0
20    chr17 38698666  0   2   0   5 0 0 0
21    chr17 39890876 11 543   7 168 0 0 0
22    chr17 39992584  1  10   0  11 0 0 0
23    chr17 40125263  0   8   0   2 0 0 0
24    chr17 40169960  6   0   5   0 0 0 0
25    chr17 41393295  0  10   2  69 0 0 0
26    chr17 41445145  1   0  17   0 0 0 0
27    chr17 41603882  0   3   0   6 0 0 0
28    chr17 43212963  0   0   2   1 0 0 0
29    chr17 43247930  1   1   0   0 0 0 0
30    chr17 43247957  1   3   0   0 0 0 0
31    chr17 46970259  0  97   1  19 0 0 0
32    chr17 47039132  1  14   0   4 0 0 0
33    chr17 47073949  1   7   0  56 0 0 0
34    chr17 53046209  0   9   0   6 0 0 0
35    chr17 56083934  1   0   4   0 0 0 0
36    chr17 56708979  1   3   0   7 0 0 0
37    chr17 72767435  0   3   0   1 0 0 0
38    chr17 73042682  1   0   7   0 0 0 0
39    chr17 73873394  0   0   5   0 0 0 0
40    chr17 75279309  0   0   2   0 0 0 0
41    chr17 76170735  1   8   1   0 0 0 0
42    chr17 76172691  0   6   0   1 0 0 0
43    chr17 76183043 29   1   3   0 0 0 0
44    chr17 76313609  0   5   1   0 0 0 0
45    chr17 78235460  2   0  16   0 0 0 0
46    chr17 78398243  0   4   1   0 0 0 0
47    chr17 79976425  1   0   9   0 0 0 0
48    chr17 79977970  5   0   9   1 0 0 0
Advertisements

7 responses to “Calculate nucleotide frequency with Rsamtools pileup

    • You mean, computing the allelic counts for several SNPs? Yes.

      library(VariantAnnotation)
      library(Rsamtools)

      #read bamfile
      bamfile <- "Pol2ChIP_chr17.bam"
      bf <- BamFile(bamfile)

      #read list of SNPs (several)
      vcf <- readVcf("vcfExample_chr17.vcf", "hg19")
      vcf.ranges <- rowData(vcf)

      #compute pileup
      p_param <- PileupParam(max_depth=1000, ignore_query_Ns=FALSE)
      res <- pileup(bf, scanBamParam=param, pileupParam=p_param)
      dim(res)

      #create table of nut frequencies
      pileupFreq(res)

      Like

  1. Hi Ines, i can’t load the BAM file into Rstudio. Can you help me with that? Thank you in advance for your time. Very good post!
    Regards

    Like

  2. Hey !! Your code is interesting.
    I have a question about the head of the output :
    seqnames start A C G T N = –
    What does “=” or “-” mean ? Insertion and Deletion ?
    Thanks

    Like

  3. Thanks for the post.
    You can generate a table of nucleotide frequencies in 1 line using the dcast function:
    dcast(res, seqnames+pos ~ nucleotide, value.var=”count”, fun.aggregate=sum)

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s