Barcode splitting of NGS data

The Checkout script provides easy and flexible barcode-based separation of samples sequenced in the same run. It allows separation of single- and paired-read data, based on one (master) or two (master and slave) barcodes.
In its basic setting, the script takes a configuration file and fastq file(s) as an input:

$ groovy /path/to/Checkout.groovy barcodes.txt R1.fastq.gz,R2.fastq.gz /path/to/out/

Barcode file should contain the following:
sample_id / master_barcode / slave_barcode(optional)

Barcode has a seed-based structure, treating capitalized characters as seed and non-capitalized as a fuzzy-matching part of barcode. E.g. “atgcAAAAatgcatgc” will first search for “AAAA” seed with no mismatches allowed and then inspect all occurrences with the full barcode allowing for some mismatches (see -e parameter).
Note that the script doesn’t allow for truncated barcodes and indels (which should not be a problem for Illumina data and most experimental settings). Master barcode is searched in all possible read variants: R1, R1 reverse-complement, R2 and R2 reverse-complement; unless specified by -o option.

The script handles redundant DNA characters, e.g. “atnnMAAAatyc”.

The same matching is performed for slave barcode if specified. In contrast to master, slave barcode is searched in a reverse complement of mate containing master barcode, e.g. R1 reverse-complement in case master barcode is in R2. This corresponds to standard Illumina output and could be changed by setting -r parameter.

Barcode-containing reads are split so that master-containing read is always placed into sample_id_R1 fastq file. Unassigned reads are stored in undef_R1 and undef_R2 unchanged, and the summary statistic is provided in out/checkout_log.txt. Use -c option for compressed output.

Though routinely used by me in parsing fresh NGS data, the main purpose of this script was to handle reads encoded with unique molecular barcodes (which is a bit confusing). This is a rather novel technology, with main aims on efficient error elimination and data normalization (have a look at these NAR and PNAS papers). To utilize it, simply mark the non-redundant region of barcode with capital ‘N’ (e.g. “ATGCatgcNNNNNNNatgc”) and use -s option. This will transfer MC region sequence and its quality as an additional “BC $sequence $quality” string in the header of resulting fastq files.

PS This script is rather fast and performs barcode search in parallel (number of threads could be set with -p).

VCF file annotation

Genomic alterations, such as point mutations, can be extracted from next-generation sequencing data via the combination of bowtie and SAMTools. The output of variant calling procedure by SAMtools is stored in *.vcf file, which is a gold standard format in variant calling. Given a high depth of coverage (typically around 100x) for the mutation site, one could be quite confident in its presence. After filtering the low-confidence variants, the next step is always to determine the functional properties of remaining mutations (i.e. to tell driver mutations from passengers and common SNPs).

This little script and a highly-compressed (137Mb) annotation bundle performs a comprehensive annotation of VCF entries:

  • Searches for matching SNP and COMSIC entries
  • Searches for parent transcript and segment (exon, intron or UTR)
  • Translates mutation-related codons and finds missense mutations and frameshift indels

Also it is far more simple in use and installation than Annovar and GATK tools

Simply download the script itself, then download vcf_annot_bundle.zip and extract in the same folder VcfAnnot.groovy is. The bundle contains up-to-date dbSNP and COSMIC entries as well as RefGene exon coordinates and fasta sequences.

Run the script as follows:

$ groovy VcfAnnot.groovy path/to/vcf

The “path/to/vcf” is either a path to single file, multiple files (comma-separated), or a directory with *.vcf files. Data could be pre-filtered by coverage depth (-d parameter) or maximum allowed allele frequency (-f parameter). If -u option is set, annotations are appended with URLs for fast navigation to Genome Browser, dbSNP and COSMIC.
IMPORTANT: unless -e option is set, the output will be performed in short form, selecting a single representative isoform (transcript). The hierarchy of representative isoforms is that mutation in Exon is preferred over UTR and Intronic mutations, and then the isoform with longest CDS (‘canonical’ according to UCSC GB group) is selected. With extended output, the position of mutation in all gene isoforms is reported.

A new file (in *.vcfa format) will be produced for each VCF input. The VCFa file has a header which is quite self-explanatory. In case of multiple transcripts, the positions and mutated codons are provided as a single comma-separated record. The script also handle Indels, but only translates them and informs of possible frameshift, as it is rarely possible to find the matching dbSNP and COSMIC entries. Note that the relative coordinate for intronic breakpoints are provided (to see if they are near splice sites): e.g. for intron of length 500bp ‘+10’, ‘-15’ and ‘250’ mean that variant is 10bp from the start of intron, 15bp from the end of intron and in the midst of intron respectively.