This continues the series of posts meant to help you write concise and well-tested bioinformatic tools.
Lets write a tool that reads a BAM file, does some modification to the records, then writes a BAM file. Lets ensure that the NM/UQ/MD SAM tags on each read are accurate, which is important if aligners aren’t well-behaved (hint: they aren’t, I should know, I’ve written a few):
Lets examine it in a little more depth:
- lines 3-11: even more imports than last time!
- lines 18-20: we have required command line arguments for the paths to the input SAM or BAM file, an indexed reference FASTA file, and the output SAM or BAM file. These are all required. The need for the reference is to so we can re-compute the number of mismatches of the read relative to the reference.
- line 22-24: we validate we can read the input BAM and the reference FASTA, and validate that we can write to the output BAM path.
- line 26: we open the input BAM for reading using a
SamSource, which we previously explained that it implements an
- line 27: we use htsjdk’s
ReferenceSequenceFileWalkerto retrieve the reference sequence; we require the input BAM to be in coordinate sorted order so we can load each reference sequence (contig) sequentially and only once.
- line 28: we open the output BAM for writing using a
SamWriter. Records can be added one-by-one or as a collection using the
- line 30: we do a quick check that the input
SAMFileHeaderspecifies that the file is in coordinate order. We use fgbio’s
SamOrderto validate the order; we will see later in-depth the scala API for specifying sort orders. I leave as an exercise to the reader to add a check that the records are actually in coordinate order.
- line 32: using the usual way to iterate over a collection, we examine each record one-by-one
- line 33: we use the
regenerateNmUqMdTagsmethod to do the hard work, which is not covered here.
- line 34: the modified record is added to the output writer
- line 37: as explained previously, we close the input reader using the
safelyClosemethod to catch any exceptions
- line 38: the output reader is closed; here we want to know if there was a problem closing the file, since then our output may be malformed.
The help message for the tool is as follows:
So in 40 lines of code we wrote a tool to update the NM, UQ, and MD tags, and only ~10 lines in the
execute() method itself. I am omitting tests for this tool, which we absolutely should have, but to suffice it to say the
regenerateNmUqMdTags should not be tested here (it is tested elsewhere), just that the tool can be run end-to-end. Questions?