The DemuxFastqs tool in fgbio is a flexible tool for sample demultiplexing. It takes FASTQs as input, one per sub-read (ex. index read, read one, read two, fragment). You tell it which bases are part of the sample barcode, molecular barcode, and template, and which should be ignored (see Read Structures).

Some sequencers (ex. MiSeq) output one FASTQ per sub-read. Some sequencers (ex. NextSeq) or systems (ex. BaseSpace) produce more than one FASTQ per read. This may be due to the data spread across multiple lanes, or because the data is large and is split across multiple FASTQs. Regardless, since DemuxFastqs expects a single FASTQ per sub-read, some simple pre-processing needs to be done.

The first option is to concatenate the FASTQs for each sub-read. Remember, you also need to specify a read structure for each sub-read.

The second option eliminates the need for creating combined FASTQs ahead of time and thus reduces the disk space used. Here, we use some shell magic to concatenate them on the fly. But I’d avoid this one for now, until Java is fixed (see issue htsjdk#1084).

You can also used named pipelines, one per sub-read, but that’s beyond the scope of this post. As a mathematician would see, “an exercise left to the reader”.

Leave a Reply

Your email address will not be published. Required fields are marked *