Sample demultiplexing an Illumina sequencing run can be a real pain if you’re the one having to do it.
Typically you would just let the instrument do it for you, or perhaps even use Illumina’s
bcl2fastq, but sometimes you don’t have that choice. You may only have access to the run folder, loathe making FASTQ files (more to come on that in a another blog post), or have a complex read structure (with sample and/or molecular barcodes, template bases, and bases to skip). Lets talk about a few other options than
bcl2fastq, and at the end, I am going to give you a command line tool that automates all the steps (for the impatient, here it is).
The first option applies if the sequencer gives you FASTQ filess that contain reads from all the samples. You may have turned off the on-instrument sample demultiplexing, for example when the sample barcodes are inline in the read itself. You may go back to the
bcl2fastq user guide, and stare intently at the
--use-bases-mask option, but the other option is to use fgbio‘s DemuxFastq tool on the FASTQs that include sample barcodes. If you are familiar with “read structures”, this may make more sense to you, and I’ll explain “read structures” down below.
The other option I would like to discuss is using both the fgbio and Picard suite of tools to perform sample demultiplexing and output BAM files (no FASTQs please). The Picard suite of tools was developed at the Broad Institute, where I previously led the software team responsible for this suite of tools as well as all the software for their genome sequencing center, among other things. These tools have been hardened through processing millions of samples and support any sequencer type, even the new NovaSeq!
The idea is to do the following:
CheckIlluminaDirectoryto ensure that the sequencing run folder has all the necessary files present.
ExtractBasecallingParamsForPicardto create the necessary configuration files for steps 3-4 below.
ExtractIlluminaBarcodesto assign each read to a sample barcode, if possible.
IlluminaBasecallsToSamto use the output of step 3 to create a BAM file per sample.
Steps 2-4 are typically run per-lane, so you’ll need to merge the per-lane-per-sample BAM files.
A number of metrics are useful to collect on the sequencing run and output BAM files and can be collected by doing the following:
CollectIlluminaLaneMetricswhich outputs both
IlluminaPhasingMetrics, and is mainly used to get the cluster density (in units of # of clusters per mm^2).
IlluminaBasecallingMetricsand gives per-sample-per-lane yield (# of reads and bases) and density (# of clusters per tile).
QualityYieldMetricsand gives per-sample yield (# of reads and bases) and quality (how many bases had base quality Q20 ( or Q30) or higher?).
Step 3 is typically run per-sample, so on the sample BAM merged across lanes.
The two critical inputs are a Sample Sheet and a Read Structure. The former should be familiar to any sequencing laboratory running an Illumina instrument, so go be social and talk to your lab compatriots. The latter is something used mainly by a few key fgbio and Picard tools that I detail below.
A Read Structure is a sequence of
<number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use
+ instead of a number for its length. The
+ means translates to whatever bases are left after the other segments are processed and can be thought of as meaning
[0..infinity]. This latter feature is not supported by Picard, only fgbio.
Four kinds of operator are supported:
Tor Template: the bases in the segment are reads of template (e.g. genomic dna, rna, etc.)
Bor Sample Barcode: the bases in the segment are an index sequence used to identify the sample being sequenced
Mor Molecular Barcode: the bases in the segment are an index sequence used to identify the unique source molecule being sequence (i.e. a UMI)
Sor Skip: the bases in the segment should be skipped or ignored, for example if they are monotemplate sequence generated by the library preparation
I personally do a lot of work with unique molecular identifiers (UMIs) or molecular barcodes, as well as technology that puts the sample barcode inline in the read (i.e. not in index one or two). Examples I can publicly link to are duplex sequencing (for STR calling in fgstr), or long read single-molecule assembly. Using tools that support an explicit Read Structure on the command line is my preferred mode for handling these types of data. Also check out the ExtractUmisFromBam tool to extract UMI information from an unmapped BAM, but if we include the molecular barcode segments in the Read Structure passed to the tools above, this step is not necessary.