This continues the series of posts meant to help you write concise and well-tested bioinformatic tools.

Lets write a tool that reads a SAM or a BAM file.  Since just reading it is boring, lets count the number of fragment and paired end reads respectively.  Here’s the tool:

Lets examine it in a little more depth:

  • line 3-8: you’ll see a bunch more imports than the previous post; I’ll get to their usage in a second.
  • line 16: here’s where you specify a required command line option, namely the input file.   Any input argument annotated with the @arg annotation is exposed on the command line.  In this case, there’s no default value, so it is required.  The type here is PathToBam, which is a type alias to java.nio.file.Path.  I’ll do a deep dive into command line arguments later (ex. collections, options, required vs. optional, grouping).
  • line 18: the class body is the constructor in scala, so any validation of the command line arguments can occur here. In this case, we validate that the input SAM or BAM is readable.  A follow-on post will show some more examples of command line validation.
  • line 20: the execute() method contains the logic of our tool.  On this line we open a reader for the SAM or BAM file.  A SamSource implements an Iterator over SamRecords.  This gives us all the usual scala collection methods, such as map, filter, takeWhile (be careful about with this one), and so on.  The SamRecord trait is the scala version (a wrapper) of htsjdk’s SAMRecord.
  • line 21-22: while scala purists generally frown upon using mutable variables, lets do so for simplicity
  • line 23: as mentioned above, SamSource implements Iterator[SamRecord], and so we can use the usual scala collection methods.  Here we filter out reads that are secondary, supplementary, or paired-end and second-of-pair.  This means that the remaining records should either be fragment reads (not paired-end), or paired-end and first-of-pair.
  • line 24: for each remaining record, we increment the appropriate variable based on if the read is paired or not.
  • line 25: we print out the results after iterating over all the records.  Typically we use the Metric class to store, document, and write metrics, but I’ll get to that in a later post.
  • line 26: the implicit class com.fulcrumgenomics.commons.CommonsDef.SafelyClosable provides a method safelyClose that calls the close() method on a Closeable object and catches any exception.  This is useful in cases where we don’t care if the files was successfully closed, like when we are reading in data.

If we run the tool without any arguments, we get the usage as well as a message stating we need to supply an input SAM or BAM (it’s colorized in your terminal):

Next, when we run the tool on dummy, we the following output:

As you can see, the tool’s source code is only 28 lines long, and only 7 lines in the execute() method itself.   To my eye, it is extremely readable and concise.  You know exactly what the tool is doing.  Neat huh?

Leave a Reply

Your email address will not be published. Required fields are marked *