Somatic tumor-only variant calling can be improved by incorporating a UMI per-strand on the source molecule. You can also perform Duplex Sequencing (see Kennedy et al 2014) but I will go into that in a later post. Below I will detail the process as well as give you a command line tool that automates all the steps (for the impatient, here it is).
If you are not familiar with adding in a unique molecular identifier (UMI), the basic idea is to have a unique synthetic DNA sequence to identify unique library molecules. There have been a few posters (and probably more than I am not aware of) that can help you understand the setup better:
- Swift Biosciences at ACMG 2017
- Integrated DNA Technologies -Subclonal mutation profiling – characterization of targeted deep sequencing with molecular tagging
- Multiplexed linear amplification for rapid targeted library construction
There are two possible ways to use this information:
- Improve PCR duplicate detection.
- Correcting sequencing errors via consensus calling reads that have the same unique library (source) molecule.
There’s really no reason not to do #2 if you already have UMIs. We can then feed in the consensus reads into a somatic tumor-only variant caller to see the results.
The first requirement is to have a BAM file with the single UMI for each read stored in a SAM tag (i.e. RX). You’ll need to do this on your own depending on the data you have been given, but below I list a few options:
- Illumina Sequencing Run Folder: you demultiplex AND extract your UMI tags all in one step, as described previously in my post: Sample Demultiplexing an Illumina Sequencing Run.
ExtractUmisFromBam: you have an unmapped BAM file for a single sample with the UMIs within the reads themselves.
FastqToBam: you have one or more FASTQs (ex. read one, read two, index one, index two) for a single sample with the UMIs within one of the FASTQs (ex. index read one).
DemuxFastqs: you have one or more FASTQs (ex. read one, read two, index one, index two) for multiple samples that you need to demultiplex AND extract UMI information.
AnnotateBamWithUmis: you have a FASTQ with the UMIs and a BAM file for a single sample with the sample/template bases.
Now we should have a BAM for a single sample with the UMI per-read stored in the RX tag. There are now three major steps:
- Map the raw reads (with
bwa-mem), duplicate mark them (optional, with Picard’s
MarkDuplicates) , and group them by UMI (fgbio’s
- Call consensus reads (fgbio’s
CallMolecularConsensusReads), map them (with
bwa-mem), then filter them (fgbio’s
- Clip overlapping read pairs (fgbio’s
ClipBam), collect metrics (see the various Picard
Collect*tools), and make variant calls (with
The secret sauce here is the grouping of similar UMIs (see Smith et al 2016) and then calling consensus reads (see fgbio’s approach on its wiki). The idea behind the former is to leverage both mapping location of a read pair (template) along with the UMI to identify reads originating from the same library/source molecule. The idea behind the latter is to examine reads from the same library/source molecule base-by-base to assess the likelihood of each base in the source molecule and make a maximum likelihood call and generate a corresponding base quality. When calling consensus reads, there is some logic incorporated to handle indel errors or differences (for example when genotyping simple-tandem-repeats (STRs)). We can subsequently filter reads by requiring a minimum # of raw reads required to make a call, among other options (see fgbio’s