VCF file annotation

Genomic alterations, such as point mutations, can be extracted from next-generation sequencing data via the combination of bowtie and SAMTools. The output of variant calling procedure by SAMtools is stored in *.vcf file, which is a gold standard format in variant calling. Given a high depth of coverage (typically around 100x) for the mutation site, one could be quite confident in its presence. After filtering the low-confidence variants, the next step is always to determine the functional properties of remaining mutations (i.e. to tell driver mutations from passengers and common SNPs).

This little script and a highly-compressed (137Mb) annotation bundle performs a comprehensive annotation of VCF entries:

  • Searches for matching SNP and COMSIC entries
  • Searches for parent transcript and segment (exon, intron or UTR)
  • Translates mutation-related codons and finds missense mutations and frameshift indels

Also it is far more simple in use and installation than Annovar and GATK tools

Simply download the script itself, then download vcf_annot_bundle.zip and extract in the same folder VcfAnnot.groovy is. The bundle contains up-to-date dbSNP and COSMIC entries as well as RefGene exon coordinates and fasta sequences.

Run the script as follows:

$ groovy VcfAnnot.groovy path/to/vcf

The “path/to/vcf” is either a path to single file, multiple files (comma-separated), or a directory with *.vcf files. Data could be pre-filtered by coverage depth (-d parameter) or maximum allowed allele frequency (-f parameter). If -u option is set, annotations are appended with URLs for fast navigation to Genome Browser, dbSNP and COSMIC.
IMPORTANT: unless -e option is set, the output will be performed in short form, selecting a single representative isoform (transcript). The hierarchy of representative isoforms is that mutation in Exon is preferred over UTR and Intronic mutations, and then the isoform with longest CDS (‘canonical’ according to UCSC GB group) is selected. With extended output, the position of mutation in all gene isoforms is reported.

A new file (in *.vcfa format) will be produced for each VCF input. The VCFa file has a header which is quite self-explanatory. In case of multiple transcripts, the positions and mutated codons are provided as a single comma-separated record. The script also handle Indels, but only translates them and informs of possible frameshift, as it is rarely possible to find the matching dbSNP and COSMIC entries. Note that the relative coordinate for intronic breakpoints are provided (to see if they are near splice sites): e.g. for intron of length 500bp ‘+10’, ‘-15’ and ‘250’ mean that variant is 10bp from the start of intron, 15bp from the end of intron and in the midst of intron respectively.

3 comments

    • xue

      Hi Mikessh,
      I noticed that using extended report option, for different isoforms related with a SNV, the output fields REF_CODON and VAR_CODON never change; and the AA_SEQ field contains the same type of amino acid before and after the codon change, for example, Met222Thr;Met256Thr;…

      I understand that in most of the case they shouldn’t change. But I never observed counter-case.

      Would you please tell me why is that?

      • mikessh

        Correct me if I’m wrong, but for the counter-case to happen there should be a gene that has two alternative isoforms with a pair of exons which overlap in genome and are in different frames. I don’t know any such case

Leave a comment