The Checkout script provides easy and flexible barcode-based separation of samples sequenced in the same run. It allows separation of single- and paired-read data, based on one (master) or two (master and slave) barcodes.
In its basic setting, the script takes a configuration file and fastq file(s) as an input:
$ groovy /path/to/Checkout.groovy barcodes.txt R1.fastq.gz,R2.fastq.gz /path/to/out/
Barcode file should contain the following:
sample_id / master_barcode / slave_barcode(optional)
Barcode has a seed-based structure, treating capitalized characters as seed and non-capitalized as a fuzzy-matching part of barcode. E.g. “atgcAAAAatgcatgc” will first search for “AAAA” seed with no mismatches allowed and then inspect all occurrences with the full barcode allowing for some mismatches (see -e parameter).
Note that the script doesn’t allow for truncated barcodes and indels (which should not be a problem for Illumina data and most experimental settings). Master barcode is searched in all possible read variants: R1, R1 reverse-complement, R2 and R2 reverse-complement; unless specified by -o option.
The script handles redundant DNA characters, e.g. “atnnMAAAatyc”.
The same matching is performed for slave barcode if specified. In contrast to master, slave barcode is searched in a reverse complement of mate containing master barcode, e.g. R1 reverse-complement in case master barcode is in R2. This corresponds to standard Illumina output and could be changed by setting -r parameter.
Barcode-containing reads are split so that master-containing read is always placed into sample_id_R1 fastq file. Unassigned reads are stored in undef_R1 and undef_R2 unchanged, and the summary statistic is provided in out/checkout_log.txt. Use -c option for compressed output.
Though routinely used by me in parsing fresh NGS data, the main purpose of this script was to handle reads encoded with unique molecular barcodes (which is a bit confusing). This is a rather novel technology, with main aims on efficient error elimination and data normalization (have a look at these NAR and PNAS papers). To utilize it, simply mark the non-redundant region of barcode with capital ‘N’ (e.g. “ATGCatgcNNNNNNNatgc”) and use -s option. This will transfer MC region sequence and its quality as an additional “BC $sequence $quality” string in the header of resulting fastq files.
PS This script is rather fast and performs barcode search in parallel (number of threads could be set with -p).