Somatic tumor-only variant calling can be improved by incorporating a UMI per-strand on the source molecule.  You can also perform Duplex Sequencing (see Kennedy et al 2014) but I will go into that in a later post.   Below I will detail the process as well as give you a command line tool that automates all the steps (for the impatient, here it is).

If you are not familiar with adding in a unique molecular identifier (UMI), the basic idea is to have a unique synthetic DNA sequence to identify unique library molecules.  There have been a few posters (and probably more than I am not aware of) that can help you understand the setup better:

There are two possible ways to use this information:

  1. Improve PCR duplicate detection.
  2. Correcting sequencing errors via consensus calling reads that have the same unique library (source) molecule.

There’s really no reason not to do #2 if you already have UMIs.  We can then feed in the consensus reads into a somatic tumor-only variant caller to see the results.

The first requirement is to have a BAM file with the single UMI for each read stored in a SAM tag (i.e. RX).  You’ll need to do this on your own depending on the data you have been given, but below I list a few options:

  1. Illumina Sequencing Run Folder: you demultiplex AND extract your UMI tags all in one step, as described previously in my post: Sample Demultiplexing an Illumina Sequencing Run.
  2. fgbio’s ExtractUmisFromBam: you have an unmapped BAM file for a single sample with the UMIs within the reads themselves.
  3. fgbio’s FastqToBam: you have one or more FASTQs (ex. read one, read two, index one, index two) for a single sample with the UMIs within one of the FASTQs (ex. index read one).
  4. fgbio’s DemuxFastqs: you have one or more FASTQs (ex. read one, read two, index one, index two) for multiple samples that you need to demultiplex AND extract UMI information.
  5. fgbio’s AnnotateBamWithUmis: you have a FASTQ with the UMIs and a BAM file for a single sample with the sample/template bases.

Now we should have a BAM for a single sample with the UMI per-read stored in the RX tag.  There are now three major steps:

  1. Map the raw reads (with bwa-mem), duplicate mark them (optional, with Picard’s MarkDuplicates) , and group them by UMI (fgbio’s GroupReadsByUmi).
  2. Call consensus reads (fgbio’s CallMolecularConsensusReads), map them (with bwa-mem), then filter them (fgbio’s FilterConsensusReads).
  3. Clip overlapping read pairs (fgbio’s ClipBam), collect metrics (see the various Picard Collect* tools), and make variant calls (with VarDictJava).

The secret sauce here is the grouping of similar UMIs (see Smith et al 2016) and then calling consensus reads (see fgbio’s approach on its wiki).  The idea behind the former is to leverage both mapping location of a read pair (template) along with the UMI to identify reads originating from the same library/source molecule.  The idea behind the latter is to examine reads from the same library/source molecule base-by-base to assess the likelihood of each base in the source molecule and make a maximum likelihood call and generate a corresponding base quality.  When calling consensus reads, there is some logic incorporated to handle indel errors or differences (for example when genotyping simple-tandem-repeats (STRs)).  We can subsequently filter reads by requiring a minimum # of raw reads required to make a call, among other options (see fgbio’s FilterConsensusReads).

This approach can be adapted to call consensus reads on long-read technology (think 10x Genomics or iGenomX).

To run a single command line tool that automates all these steps, check out the SingleStrandUmiSomaticVariantCalling pipeline (source code is here) from my bfx-examples project.

 

 

 

 

 

11 Comments

  • Dear Nils Homer,

    I tried to follow the steps you outlined to generate consensus reads, but encountered some problems.

    1. If I extract UMIs using the fgbio’s ExtractUmisFromBam or FastqToBam, how can I preserve the UMIs in the next step (map the raw reads with bwa-mem)? Is it necessary to convert the UMI tagged bam file to fastq files? If it is, how can we preserve the UMIs on the fastq reads? If not, what other way can we use to map the raw reads with the UMI tags?

    2. I also tried to use AnnotateBamWithUmis to add UMIs to the reads in an aligned bam file. RX tags were successfully added to the last field of the bam file. However, when I tried to group the reads with GroupReadsByUmi, an empty bam file was generated. My bam file size should be around 100 MB. There are 200,000 – 400,000 families of UMIs. Do you have any suggestions?

    Thank you very much for your kind advice!

    Shisheng

    • 1. Always recommend using FASTQs as intermediates, so to map the reads and preserve the RX tags, I would save the unmapped BAM, then run Picard’s SamToFastq, pipe that to bwa mem, then pipe that back into Picard’s MergeBamAlignment. The latter takes the unmapped BAM as input so that all the header and per-record tags are preserved. This is implemented in the pipeline that I linked to: https://github.com/nh13/bfx-examples/blob/master/pipelines/src/main/scala/com/nilshomer/bfx/examples/pipelines/SingleStrandUmiSomaticVariantCalling.scala#L118.

      2. I would try approach #1.

    • Hi,

      While doing step 2 ‘Call consensus reads (fgbio’s CallMolecularConsensusReads), map them (with bwa-mem), then filter them (fgbio’s FilterConsensusReads).’, I run into trouble.
      After calling the consesnus reads I am getting bam output like this:

      lib1:AAACACATTT 141 * 0 0 * * 0 0 AATACTGGGTAAATCTGCATATGGTGACTCGGAAGAATCTTCCTGGTGGATTGGTGAGAGTGGCTTGCAGAGATGTTGGTCCCACGTGACTCCCTTTGTG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNMNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN cD:i:3 cE:f:0 RG:Z:cons MI:Z:AAACACATTT cM:i:3 RX:Z:AAACACATTT cd:B:s,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 ce:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

      SamToFastq fails on this with ‘Illegal mate state: lib1:AAAAAAAAAA
      I guess as multiple reads are named after the same barcodes, it makes SamToFastq think the bam is unsorted. I can extract the reads in a different way for realignment with bwa but It would be nice to have unique read names. Maybe a randomly picked name from reads that were taken for the consensus sequence?
      Not sure whether I am screwing this up or if it is supposed to be this way..

      Thanks,
      Jan

  • Hi Nils,

    We are currently working with the swift MIDs, and they recommend using your program, fgbio for analytics: (Attached, top of page 6)
    https://swiftbiosci.com/protected-content/mid-tech-note/
    This is how they list what should be done:
    1. Align FASTQ reads (BWA)
    2. Connect MID (I2 FASTQ) to bam alignments: fgbio, AnnotateBamWithUmis
    3. Optional: Determine gain in data by using Picard MarkDuplicate with option BARCODE_TAG=RX. Output should show lower rate of “% Duplication” than without using MIDs.
    4. Continue with following steps to get MID base consensus sequence using fgbio & Picard Tools:
    a. RevertSam: java -jar $picard RevertSam I=$OUT.midbased.markdup.bam O=$OUT.sanitised.bam SANITIZE=true
    REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false
    b. Fgbio SetMateInformation
    c. Fgbio GroupReadsByUmi
    d. Fgbio CallMolecularConsensusReads
    5. Consensus Bam does not contain any mapping information, hence use BamToFASTQ to generate consensus Fastq reads which can then be aligned to reference followed by variant calling.

    Currently we want the final file to be a BAM/SAM file that has been fully deduplicated, umi’s are grouped/clustered, and the samples are demultiplexed. Once done, we are going to use samtools and qualmaps for analysis on % coverage with respect to the host for the sample data. Our inputs include: Read1, Read2, Index1, and Index2. Index1 is the sample barcode (6B) and index 2 is a i7 adapter with the final 9bp as the umi (8S9M).

    This led to a couple questions:
    1. I noticed that fgbio has a program name “DemuxFastqs”, which generates seperate sample bam files from the input files. Will these BAM files be considered mapped or unmapped? (Do they need to be reverted to unmapped using RevertSam?)

    2. If it does create an unmapped BAM file, here is my current idea on how to create the final BAM file:
    a. Run DemuxFastqs to get uBAM files for each sample with UMI tag attached to read.
    b. Use SetMateInformation, then GroupReadsByUmi, and into CallMolecularConsensusReads.
    c. Then FilterConsensusReads, and then convert that file to fastq.
    d. Align one more time to create BAM file for each sample. Each sample then should have umis properly grouped, with reads corrected and filtered. This file will be used in QualiMaps and samtools to check for sample coverage with respect to host.

    If I was to run this, do you think this would work?

    Sorry for the long message, I am relatively new to this field, so there’s been a lot to learn.

    Thanks!
    Max

    • Hey Max,

      I’ll answer briefly here, but please post further questions on the fgbio github page for others to benefit from the discussion: https://github.com/fulcrumgenomics/fgbio/issues

      1. DemuxFastqs demultiplexes FASTQs as input and outputs either/both FASTQs and BAMs (per-sample). Since the input does not contain any mapping information, neither will the output.
      2. This looks like sane from a high-level. I would just comment that we use uBAMs with Picard’s SamToFastq piped into BWA piped back into Picard’s MergeBamAlignment to do alignment and retain the tags/metadata in uBAMs.

      Thanks for checking this blog out!

  • Hi Nils,

    I am trying to use the CallMolecularConsensusReads tool and am running into a problem. I used the same pre-processing as described above (save the unmapped BAM, then run Picard’s SamToFastq, pipe that to bwa mem, then pipe that back into Picard’s MergeBamAlignment) and everything looks good up until I actually run CallMolecularConsensusReads. Once I do that, I go from .bam files with 162,000+ reads to 48. I should have 4-5,000 unique UMIs, so 48 is much lower than expected. Do you have an idea of where I might be losing so many of my sequences?

    Thanks in advance!

Leave a Reply

Your email address will not be published. Required fields are marked *