Dramatic sequencing error-reduction can be achieved through calling consensus reads that observe the same source molecule. This is extremely important for low-frequency variant detection in somatic samples, as well as genotyping STRs (short-tandem repeats) where high rates of “stutter” can be overcome. I will review the method behind calling consensus reads.
There are two methods I would like to consider:
- Reads observing the same source molecule (ex. see some of IDT’s posters:  )
- In addition to (1), reads that also retain from which strand of a double strand source molecule it was generated (ex. Kennedy et al 2014)
Method (1) Can vastly reduce the raw sequencing errors and further distinguish PCR duplicates from true independent measurements, but are still susceptible to errors introduced in sample preparation, library preparation, target capture. These artifacts include oxidative damage (Costello et al 2013), deamination (Chen et al 2014), and others that are specific to the preparation and technology. I recommend leveraging quality control metrics for these artifacts, such as Picard’s
CollectSequencingArtifactMetrics and fgbio’s
ErrorRateByReadPosition among others. Method (2) can be used to overcome some of these artifacts since errors can be strand-specific, for example oxidative damage and deamination.
I have already covered the chain of bioinformatic tools used for Single Strand UMI Somatic Variant Calling. In a latter set of posts, I will also detail the process and tools behind calling consensus reads from Duplex Sequencing (here’s a preview in fgbio), as well as end-to-end STR genotyping using Duplex Sequencing (with fgstr). For posterity, I have already written up a detailed explanation on fgbio’s wiki, and will leave the details as an exercise to the reader (one of my favorite quotes from Kenneth Lange‘s books that kept me up all night trying to finish a proof).
First, the set of reads that observe the same source molecule are determined using the UMI(s) and genomic coordinates of both ends of a read. The latter gives us the genomic span and strand of the read pair (i.e. template). This can be accomplished using fgbio’s
GroupReadsByUmi. Next, we observe that for each source molecules, all reads map to the same genomic strand and start/stop*, so we can proceed base-by-base to call a consensus read. The asterisk (*) here is to denote that we infer the genomic coordinates of the reads using their “unclipped” position, meaning we factor in any soft-clipping during mapping, which is standard for PCR duplicate detection.
Aside: Some of you may say “Wait! Can’t an indel (insertion/deletion) error cause a read to be ‘out-of-phase’ with the rest of the reads?”. That’s a big problem with sequencing technologies whose errors are predominantly indels, or error modes such as stutter in STR-genotyping. Instead of performing the expensive re-alignment of reads, we can choose the most common alignment and filter out reads that do not agree (not optimal, but works in practice). This “cigar” filtering can cause significant bias in practice for STR-genotyping if we do not have enough coverage of individual source molecules, so we purse a whole other strategy altogether for STR-genotyping (hint: joint consensus calling AND genotyping). Nonetheless, if we expect indel errors to be negligible, this is a reasonable optimization. For more details about this process, see this explanation on fgbio’s wiki.
So lets consider the set of observations for a given base of the source molecule. We have a number of possible events that can introduce an error:
- a sequencing error can occur on an individual read; we use the sequencer’s quality score to estimate this probability.
- an error could be introduce prior to incorporating the UMIs; this would cause all the reads to have the same error (ex. oxidative damage or deamination). We cannot correct these without Duplex Sequencing, where we can leverage the fact that such errors would only occur one one strand and not the other.
- an error could be introduce after to incorporating the UMIs but prior to sequencing; this would cause all the reads after this event to have the same error (ex. during PCR amplification or target capture)*.
The consensus calling model takes as input priors for the last two events. The priors should be set based on empirical analysis, due to these being specific to a sequencing protocol. The asterisk (*) here is to state that we do not model sub-populations of reads that arise from the same event, such as an error introduced into one of the first few cycles of PCR. We make the simplifying assumption that these are independent events. Further examination of your data is warranted to see if this causes any harm.
Next, if we assume a flat prior of observing any of the four DNA bases, we use a simple likelihood model to test each of the four hypotheses (one for each possible DNA bases), and choose the most likely. We can utilize these likelihoods to generate a posterior probability of the given hypotheses given the data. Finally, the posterior probability can be converted into a phred-scaled probability used as the output base-quality.
If we have unique UMI per-source molecule, that’s it. However, if we have a unique UMI per-strand per-source molecule (i.e. Duplex Sequencing), we can go a little farther. As explained here, we generate two consensus reads per-source molecule, one for the “top” strand and one for the “bottom” strand (which UMI denotes the top strand is not important, just that we can partition reads into one strand or the other). At this point, hopefully sequencing error have been almost entirely removed. We can then identify errors such as oxidative damage and deamination by looking at disagreeing base-calls for the same base across the two consensus reads (referred to as single-strand consensus reads). How to make a final call is a matter of preference, but fgbio’s
CallDuplexConsensusReads when the two single-strand consensus reads disagree, it will call use the base-call with higher quality score and set the final quality score as the difference between the two.
The result are very-high quality (single-strand) consensus reads and ultra-high quality (duplex) consensus reads, that can be used in downstream analysis to provide very high accuracy, in some cases needing only a single duplex consensus read to make a low-frequency variant call. In a subsequent post, I will detail some of the issues with obtaining high-quality consensus calls, namely ensuring both a broad sampling of source molecules as well as ensuring high enough coverage of each source molecule, and in the case of Duplex Sequencing, each strand of the source molecule.