This continues the series of posts meant to help you write concise and well-tested bioinformatic tools.
Today I’ll briefly describe a tool named
SamToFastq that converts a SAM (or BAM) to FASTQ, when the SAM has bases and qualities stored in auxiliary tags that need to be included in the FASTQs bases and qualities. This is a real issue, as tools to convert SAM to FASTQ (see Picard’s
SamToFastq) may not reconstruct the full raw read, due to bases and qualities stored in auxiliary tags. This can occur if we have removed the sample barcode, unique molecular identifiers, or even omitted (garbage/spacer) sequence altogether. A discussion is being had on hts-specs, the repository storing the SAM specification, about how to encode in a SAM enough information to restore the FASTQ. One suggestion was to encode a grammar, to describe from which tags and how bases and qualities should be reconstructed. Another practical alternative was suggested by Heng Li, where we store bases removed from the 5′ (five-prime) and 3′ (three-prime) ends of the reads in four auxiliary SAM tags:
- S5: the bases removed from the 5′ end of the read
- Q5: the qualities removed from the 5′ end of the read
- S3: the bases removed from the 3′ end of the read
- Q3: the qualities removed from the 3′ end of the read
Lets review a tool I wrote in under an hour (with some basic unit tests) to reconstruct the original FASTQ from a SAM using these tags. The pull request can be see in fgbio#340 and I have copied its current state into a gist in case the pull request is modified.
The real meat is in the
- lines 66-67: The tool takes in an input SAM and outputs three FASTQs, one for fragments, read ones, and read twos.
- lines 68-71: The four SAM tags used to store the removed bases and qualities can be overridden.
- lines 83-86: We create a FASTQ writer for each type of read: fragment, read one, and read two. Obviously some of the files may be empty if we only have fragment read, or only have paired reads.
- liens 91-99: We iterate over the input SAM records, convert them, and write them to the appropriate file. The meat is in the
- lines 104-122: The tool will sort the input BAM by
querynameif it is not already sorted to ensure that we have reads in the same order (by name) across the FASTQs. We use the Bams class/API to easily sort the BAM file and return an iterator over the records. The
SelfClosingIteratorwill close the iterator and therefore the input file when iteration is complete so we don’t have to explicitly later.
- lines 127-149: The
toFastqRecordmethod converts the SAM record to a FASTQ record. We have to be careful to reverse complement and reverse the bases and qualities respectively if the read maps to the reverse strand. This is also where we retrieve the extra bases and qualities to prepend and append to the SAM records bases and qualities.
SamToFastqTest show a few unit tests for this tool, and I’ll explain a little later how to leverage some of the classes and APIs we’ve built to make testing easier, for example concisely simulating SAM records with
Now we’ll see if the discussion on hts-specs moves to this solution, and then we can submit the changes for a pull request and have a working implementation!