Somatic tumor-only variant calling can be improved by incorporating a UMI per-strand on the source molecule.  You can also perform Duplex Sequencing (see Kennedy et al 2014) but I will go into that in a later post.   Below I will detail the process as well as give you a command line tool that automates all the steps (for the impatient, here it is).

If you are not familiar with adding in a unique molecular identifier (UMI), the basic idea is to have a unique synthetic DNA sequence to identify unique library molecules.  There have been a few posters (and probably more than I am not aware of) that can help you understand the setup better:

There are two possible ways to use this information:

  1. Improve PCR duplicate detection.
  2. Correcting sequencing errors via consensus calling reads that have the same unique library (source) molecule.

There’s really no reason not to do #2 if you already have UMIs.  We can then feed in the consensus reads into a somatic tumor-only variant caller to see the results.

The first requirement is to have a BAM file with the single UMI for each read stored in a SAM tag (i.e. RX).  You’ll need to do this on your own depending on the data you have been given, but below I list a few options:

  1. Illumina Sequencing Run Folder: you demultiplex AND extract your UMI tags all in one step, as described previously in my post: Sample Demultiplexing an Illumina Sequencing Run.
  2. fgbio’s ExtractUmisFromBam: you have an unmapped BAM file for a single sample with the UMIs within the reads themselves.
  3. fgbio’s FastqToBam: you have one or more FASTQs (ex. read one, read two, index one, index two) for a single sample with the UMIs within one of the FASTQs (ex. index read one).
  4. fgbio’s DemuxFastqs: you have one or more FASTQs (ex. read one, read two, index one, index two) for multiple samples that you need to demultiplex AND extract UMI information.
  5. fgbio’s AnnotateBamWithUmis: you have a FASTQ with the UMIs and a BAM file for a single sample with the sample/template bases.

Now we should have a BAM for a single sample with the UMI per-read stored in the RX tag.  There are now three major steps:

  1. Map the raw reads (with bwa-mem), duplicate mark them (optional, with Picard’s MarkDuplicates) , and group them by UMI (fgbio’s GroupReadsByUmi).
  2. Call consensus reads (fgbio’s CallMolecularConsensusReads), map them (with bwa-mem), then filter them (fgbio’s FilterConsensusReads).
  3. Clip overlapping read pairs (fgbio’s ClipBam), collect metrics (see the various Picard Collect* tools), and make variant calls (with VarDictJava).

The secret sauce here is the grouping of similar UMIs (see Smith et al 2016) and then calling consensus reads (see fgbio’s approach on its wiki).  The idea behind the former is to leverage both mapping location of a read pair (template) along with the UMI to identify reads originating from the same library/source molecule.  The idea behind the latter is to examine reads from the same library/source molecule base-by-base to assess the likelihood of each base in the source molecule and make a maximum likelihood call and generate a corresponding base quality.  When calling consensus reads, there is some logic incorporated to handle indel errors or differences (for example when genotyping simple-tandem-repeats (STRs)).  We can subsequently filter reads by requiring a minimum # of raw reads required to make a call, among other options (see fgbio’s FilterConsensusReads).

This approach can be adapted to call consensus reads on long-read technology (think 10x Genomics or iGenomX).

To run a single command line tool that automates all these steps, check out the SingleStrandUmiSomaticVariantCalling pipeline (source code is here) from my bfx-examples project.







  • Dear Nils Homer,

    I tried to follow the steps you outlined to generate consensus reads, but encountered some problems.

    1. If I extract UMIs using the fgbio’s ExtractUmisFromBam or FastqToBam, how can I preserve the UMIs in the next step (map the raw reads with bwa-mem)? Is it necessary to convert the UMI tagged bam file to fastq files? If it is, how can we preserve the UMIs on the fastq reads? If not, what other way can we use to map the raw reads with the UMI tags?

    2. I also tried to use AnnotateBamWithUmis to add UMIs to the reads in an aligned bam file. RX tags were successfully added to the last field of the bam file. However, when I tried to group the reads with GroupReadsByUmi, an empty bam file was generated. My bam file size should be around 100 MB. There are 200,000 – 400,000 families of UMIs. Do you have any suggestions?

    Thank you very much for your kind advice!


    • 1. Always recommend using FASTQs as intermediates, so to map the reads and preserve the RX tags, I would save the unmapped BAM, then run Picard’s SamToFastq, pipe that to bwa mem, then pipe that back into Picard’s MergeBamAlignment. The latter takes the unmapped BAM as input so that all the header and per-record tags are preserved. This is implemented in the pipeline that I linked to:

      2. I would try approach #1.

    • Hi,

      While doing step 2 ‘Call consensus reads (fgbio’s CallMolecularConsensusReads), map them (with bwa-mem), then filter them (fgbio’s FilterConsensusReads).’, I run into trouble.
      After calling the consesnus reads I am getting bam output like this:

      lib1:AAACACATTT 141 * 0 0 * * 0 0 AATACTGGGTAAATCTGCATATGGTGACTCGGAAGAATCTTCCTGGTGGATTGGTGAGAGTGGCTTGCAGAGATGTTGGTCCCACGTGACTCCCTTTGTG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNMNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN cD:i:3 cE:f:0 RG:Z:cons MI:Z:AAACACATTT cM:i:3 RX:Z:AAACACATTT cd:B:s,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 ce:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

      SamToFastq fails on this with ‘Illegal mate state: lib1:AAAAAAAAAA
      I guess as multiple reads are named after the same barcodes, it makes SamToFastq think the bam is unsorted. I can extract the reads in a different way for realignment with bwa but It would be nice to have unique read names. Maybe a randomly picked name from reads that were taken for the consensus sequence?
      Not sure whether I am screwing this up or if it is supposed to be this way..


Leave a Reply

Your email address will not be published. Required fields are marked *