VEP

Official page: https://www.ensembl.org/info/docs/tools/vep/index.html (web interface).

Detailed instructions for installation are available from here,

http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer.

There are several possible ways to install under csd3: GitHub, R and docker.

— GitHub —

GitHub Page: https://github.com/Ensembl/ensembl-vep.

The ease with this option lies with GitHub in that updates can simply be made with git pull to an exisiting release.

cd $HPC_WORK
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
# ---
# release/98
# module load htslib/1.4
# perl INSTALL.pl
# release/104
# long form:
# perl INSTALL.pl --DESTDIR Bio --ASSEMBLY GRCh38 --AUTO acfp --PLUGINS all --SPECIES homo_sapiens,homo_sapiens_merged --NO_TEST --CACHEDIR .vep
# ---
module load htslib-1.9-gcc-5.4.0-p2taavl
# short form:
perl INSTALL.pl -l Bio -y GRCh38 -a acfp -g all -s homo_sapiens,homo_sapiens_merged --NO_TEST -c .vep
ln -sf $HPC_WORK/ensembl-vep/.vep $HOME/.vep
# set up symbolic links to the executables
for f in convert_cache.pl filter_vep haplo variant_recoder vep;
    do ln -sf $HPC_WORK/ensembl-vep/$f $HPC_WORK/bin/$f; done
vep -i examples/homo_sapiens_GRCh37.vcf -o examples/homo_sapiens_GRCh37.txt \
    --force_overwrite --offline --pick --symbol

A (slightly edited due to two species at -s were installed separately) log file can be found here, VEP.log.

Note in particular that by default, the cache files will be installed at $HOME which would exceed the quota (<40GB) of an ordinary user, and as before the destination was redirected. The setup above facilitates storage of cache files, FASTA files and plugins.

The FASTA file should be automatically detected by the VEP when using –cache or –offline.

If it is not, use "–fasta $HOME/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"

Remember to use –merged or –refseq when running the VEP with _merged or _refseq cache!

Without the htslib module, the --NO_HTSLIB option is needed but "Cannot use format gff without Bio::DB::HTS::Tabix module installed". Bio::DB:HTS is in https://github.com/Ensembl/Bio-DB-HTS and change can be made to the Makefile of htslibs for a desired location, to be used by Build.PL via its command line parameters.

It is notable that VEP accepts compress (.gz) input. It is worthwhile to check for the VEP plugins: https://github.com/Ensembl/VEP_plugins. For instance, to enable the PolyPhen_SIFT plugin we first generate the database and then annotate locally.

cd ${HPC_WORK}
git clone https://github.com/Ensembl/VEP_plugins
cd VEP_plugins
# download of database
vep --database --port 3337 --force_overwrite --dir_plugin . --plugin PolyPhen_SIFT,create_db=1
# offline annotation
vep -i homo_sapiens_GRCh37.vcf -o homo_sapiens_GRCh37 --offline --force_overwrite --format vcf --dir_plugin . --plugin PolyPhen_SIFT

By default, the database port number is 5306. The create_db=1 option downloads homo_sapiens.PolyPhen_SIFT.db at ${HOME}/.vep.

One may wish to skip the comments (lines started with ##) in processing of the output, e.g., in R,

export skips=$(grep '##' examples/homo_sapiens_GRCh37.txt | wc -l)
R --no-save -q <<END
  vo <- read.delim("examples/homo_sapiens_GRCh37.txt",skip=as.numeric(Sys.getenv("skips")))
  head(vo)
END

allowing for variable number of lines given various command-line options to be skipped to have

  X.Uploaded_variation    Location Allele            Gene         Feature
1          rs116645811 21:26960070      A ENSG00000154719 ENST00000307301
2            rs1135638 21:26965148      A ENSG00000154719 ENST00000307301
3              rs10576 21:26965172      C ENSG00000154719 ENST00000307301
4            rs1057885 21:26965205      C ENSG00000154719 ENST00000307301
5          rs116331755 21:26976144      G ENSG00000154719 ENST00000307301
6            rs7278168 21:26976222      T ENSG00000154719 ENST00000307301
  Feature_type        Consequence cDNA_position CDS_position Protein_position
1   Transcript   missense_variant          1043         1001              334
2   Transcript synonymous_variant           939          897              299
3   Transcript synonymous_variant           915          873              291
4   Transcript synonymous_variant           882          840              280
5   Transcript synonymous_variant           426          384              128
6   Transcript synonymous_variant           348          306              102
  Amino_acids  Codons Existing_variation
1         T/M aCg/aTg                  -
2           G ggC/ggT                  -
3           P ccA/ccG                  -
4           V gtA/gtG                  -
5           L ctT/ctC                  -
6           K aaG/aaA                  -
                                                                     Extra
1 IMPACT=MODERATE;STRAND=-1;SYMBOL=MRPL39;SYMBOL_SOURCE=HGNC;HGNC_ID=14027
2      IMPACT=LOW;STRAND=-1;SYMBOL=MRPL39;SYMBOL_SOURCE=HGNC;HGNC_ID=14027
3      IMPACT=LOW;STRAND=-1;SYMBOL=MRPL39;SYMBOL_SOURCE=HGNC;HGNC_ID=14027
4      IMPACT=LOW;STRAND=-1;SYMBOL=MRPL39;SYMBOL_SOURCE=HGNC;HGNC_ID=14027
5      IMPACT=LOW;STRAND=-1;SYMBOL=MRPL39;SYMBOL_SOURCE=HGNC;HGNC_ID=14027
6      IMPACT=LOW;STRAND=-1;SYMBOL=MRPL39;SYMBOL_SOURCE=HGNC;HGNC_ID=14027
>

When a particular release really works well on the system, it is possible to install it, e.g.,

git checkout release/98
perl INSTALL.pl

for release 98 instead of the latest from git pull.

It could be useful to filter VEP output, see https://www.ensembl.org/info/docs/tools/vep/script/vep_filter.html.

Nearest gene

This can produces error message

-------------------- EXCEPTION --------------------
MSG: ERROR: --nearest requires Set::IntervalTree perl module to be installed

and we get around with

perl -MCPAN -e shell
install Set::IntervalTree

which will enable --nearest gene.

Annotation in chunks

A toy example, following http://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#faster, is given as follows,

cd examples
bgzip -f homo_sapiens_GRCh37.vcf
tabix -Cf homo_sapiens_GRCh37.vcf.gz
tabix -h  homo_sapiens_GRCh37.vcf.gz 22:50616005-50616006 | \
vep --cache --fork 4 --port 3337 --format vcf -o - --tab --no_stats | \
grep -v '##'
cd -

from this we could propagate the idea for autosomes in chunks as follows,

export ref=~/rds/post_qc_data/interval/reference_files/genetic/reference_files_genotyped_imputed/
export chunk_size=10000
seq 22 | \
parallel -j1 --env ref -C' ' '
  export n=$(awk "END{print NR-1}" ${ref}/impute_{}_interval.snpstats)
  export g=$(expr ${n} / ${chunk_size})
  export s=$(expr $n - $(( $g * $chunk_size)))
  (
    for i in $(seq ${g}); do
      (
        awk "BEGIN{print \"##fileformat=VCFv4.0\"}"
        awk -vOFS="\t" "BEGIN{print \"#CHROM\",\"POS\",\"ID\",\"REF\",\"ALT\",\"QUAL\",\"FILTER\",\"INFO\"}"
        (
          sed "1d" ${ref}/impute_{}_interval.snpstats | \
          awk -v i=${i} -v chunk_size=${chunk_size} -v OFS="\t" "NR==(i-1)*chunk_size+1,NR==i*chunk_size {
              if(\$1==\".\") \$1=\$3+0 \":\" \$4 \"_\" \$5 \"/\" \$6; print \$3+0,\$4,\$1,\$5,\$6,\".\",\".\",\$19}"
          if [ ${s} -gt 0 ] &&  [ ${i} -eq ${g} ]; then
             sed "1d" ${ref}/impute_{}_interval.snpstats | \
             awk -v i=${i} -v chunk_size=${chunk_size} -v OFS="\t" -v n=${n} "NR==i*chunk_size+1,NR==n {
                 if(\$1==\".\") \$1=\$3+0 \":\" \$4 \"_\" \$5 \"/\" \$6; print \$3+0,\$4,\$1,\$5,\$6,\".\",\".\",\$19}"
          fi
        )
      ) | \
      vep  --cache --offline --format vcf -o - --tab --pick --no_stats  \
           --species homo_sapiens --assembly GRCh37 --port 3337 | \
      (
        if [ ${i} -eq 1 ]; then cat; else grep -v "#"; fi
      )
    done
  ) | \
  gzip -f > work/INTERVAL-{}.vep.gz
'

Note we use information from .snpstats files at location ref to build input in vcf format on the fly and feed into VEP. For instance impute_22_interval.snpstats contains the following lines,

SNPID	RSID	chromosome	position	A_allele	B_allele	minor_allele	major_allele	AA	AB	BB	AA_calls	AB_calls	BB_calls	MAF	HWE	missing	missing_calls	information
rs587697622	rs587697622	22	16050075	A	G	G	A	43058	1.2499	0	43058	1	0	1.4514e-05	-0	0	0	0.81003
rs587755077	rs587755077	22	16050115	G	A	A	G	42485	572.42	1.7625	42105	306	1	0.0066878	0.14341	3.5437e-09	0.015026	0.68672
rs587654921	rs587654921	22	16050213	C	T	T	C	42848	211.12	0.16248	42746	124	0	0.0024553	-0	1.4175e-09	0.0043893	0.67168
rs587712275	rs587712275	22	16050319	C	T	T	C	43022	36.998	0	43022	23	0	0.00042962	-0	0	0.00032514	0.69071

ENSEMBL-synonym translation

The ENSEMBL-synonym translation is useful to check for the feature types – in the case of ENSG00000160712 (IL6R) we found ENST00000368485 and ENST00000515190, we do

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ensemblToGeneName.txt.gz
zgrep -e ENST00000368485 -e ENST00000515190 ensemblToGeneName.txt.gz

giving

ENST00000368485 IL6R
ENST00000515190 IL6R

though this could also be furnished with R/biomaRt as follows,

library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host="grch37.ensembl.org", path="/biomart/martservice")
attr <- listAttributes(ensembl)
g <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'description', 'hgnc_symbol')
t <- c('ensembl_transcript_id', 'transcription_start_site', 'transcript_start', 'transcript_end')
u <- "uniprotswissprot"
gtu <- getBM(attributes = c(g,t,u), mart = ensembl)

For ENSEMBL genes, R/grex is likely to work though it was developed for other purpose, e.g.,

R -q -e "grex::grex(\"ENSG00000160712\")"

giving

       ensembl_id entrez_id hgnc_symbol              hgnc_name cyto_loc
1 ENSG00000160712      3570        IL6R interleukin 6 receptor   1q21.3
  uniprot_id   gene_biotype
1     A0N0L5 protein_coding

The aligment of ENSG, ENST, ENSP is ensGtp.txt.gz at the same UCSC directory above.

Other possibilities include UCSC/ensembl MySQL server, e.g.,

mysql  --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 <<END
select distinct G.gene,N.value from ensGtp as G, ensemblToGeneName as N where
   G.transcript=N.name and
   G.gene in ("ENSG00000160712");
END

and

mysql -h ensembldb.ensembl.org --port 5306  -u anonymous -D homo_sapiens_core_64_37 -A <<END
select distinct
   G.stable_id,
   S.synonym
from
  gene_stable_id as G,
  object_xref as OX,
  external_synonym as S,
  xref as X ,
  external_db as D
where
  D.external_db_id=X.external_db_id and
  X.xref_id=S.xref_id and
  OX.xref_id=X.xref_id and
  OX.ensembl_object_type="Gene" and
  G.gene_id=OX.ensembl_id and
  G.stable_id in ("ENSG00000160712");
END

according to https://www.biostars.org/p/14367/.

Perhaps it is the easiest to use gprofiler2, i.e.,

gprofiler2::gconvert("ENSG00000164761")

giving

  input_number input target_number          target name                                               description                           namespace
1            1  IL6R           1.1 ENSG00000160712 IL6R interleukin 6 receptor [Source:HGNC Symbol;Acc:HGNC:6019] ENTREZGENE,HGNC,UNIPROT_GN,WIKIGENE

clinvar

Web: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/.

The local installation enables considerable flexibilty, based on https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html#custom_options.

# Compressed VCF file/Index file
# GCRh37
curl ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz -o clinvar_GRCh37.vcf.gz
curl ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz.tbi -o clinvar_GRCh37.vcf.gz.tbi
# GCRh38
curl ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz -o clinvar_GRCh38.vcf.gz
curl ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi -o clinvar_GRCh38.vcf.gz.tbi

Information is gathered from the header of the VCF file,

ClinVar Variation ID Description  
AF_ESP allele frequencies from GO-ESP  
AF_EXAC allele frequencies from ExAC  
AF_TGP allele frequencies from TGP  
ALLELEID the ClinVar Allele ID  
CLNDN ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB  
CLNDNINCL For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB  
CLNDISDB Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN  
CLNDISDBINCL For included Variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN  
CLNHGVS Top-level (primary assembly, alt, or patch) HGVS expression.  
CLNREVSTAT ClinVar review status for the Variation ID  
CLNSIG Clinical significance for this single variant  
CLNSIGCONF Conflicting clinical significance for this single variant  
CLNSIGINCL Clinical significance for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:clinical significance.  
CLNVC Variant type  
CLNVCSO Sequence Ontology id for variant type  
CLNVI the variant's clinical sources reported as tag-value pairs of database and variant identifier  
DBVARID nsv accessions from dbVar for the variant  
GENEINFO Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar ( )
MC comma separated list of molecular consequence in the form of Sequence Ontology ID molecular_consequence
ORIGIN Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic ; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other  
RS dbSNP ID (i.e. rs number)  
SSR Variant Suspect Reason Codes. One or more of the following values may be added: 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other  

We now query rs2228145,

vep --id "1 154426970 154426970 A/C 1" --species homo_sapiens -o rs2228145 --cache --offline --force_overwrite \
    --assembly GRCh37 --custom clinvar_GRCh37.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN

which gives

#Uploaded_variation	Location	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	Extra
1_154426970_A/C	1:154426970	C	ENSG00000160712	ENST00000344086	Transcript	intron_variant	-	-	-	-	-	-	IMPACT=MODIFIER;STRAND=1;ClinVar=14660;ClinVar_CLNDN=Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=association;ClinVar_FILTER=.
1_154426970_A/C	1:154426970	C	ENSG00000160712	ENST00000368485	Transcript	missense_variant	1510	1073	358	D/A	gAt/gCt	-	IMPACT=MODERATE;STRAND=1;ClinVar=14660;ClinVar_CLNDN=Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=association;ClinVar_FILTER=.
1_154426970_A/C	1:154426970	C	ENSG00000160712	ENST00000476006	Transcript	downstream_gene_variant	-	-	-	-	-	-	IMPACT=MODIFIER;DISTANCE=4515;STRAND=1;FLAGS=cds_start_NF,cds_end_NF;ClinVar=14660;ClinVar_CLNDN=Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=association;ClinVar_FILTER=.
1_154426970_A/C	1:154426970	C	ENSG00000160712	ENST00000502679	Transcript	non_coding_transcript_exon_variant	386	-	-	-	-	-	IMPACT=MODIFIER;STRAND=1;ClinVar=14660;ClinVar_CLNDN=Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=association;ClinVar_FILTER=.
1_154426970_A/C	1:154426970	C	ENSG00000160712	ENST00000507256	Transcript	non_coding_transcript_exon_variant	271	-	-	-	-	-	IMPACT=MODIFIER;STRAND=1;ClinVar=14660;ClinVar_CLNDN=Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=association;ClinVar_FILTER=.
1_154426970_A/C	1:154426970	C	ENSG00000160712	ENST00000515190	Transcript	missense_variant	481	482	161	D/A	gAt/gCt	-	IMPACT=MODERATE;STRAND=1;FLAGS=cds_start_NF,cds_end_NF;ClinVar=14660;ClinVar_CLNDN=Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=association;ClinVar_FILTER=.

A HTML summary (somehow the web browser may not display the embedded figures) is also available. The Extra column looks clumsy and we could add the --tab option to generate a tab-delimited output.

vep --id "1 154426970 154426970 A/C 1" --species homo_sapiens -o rs2228145 --cache --offline --force_overwrite \
    --custom clinvar_GRCh37.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN --tab \
    --fields Uploaded_variation,Gene,Consequence,ClinVar_CLNSIG,ClinVar_CLNREVSTAT,ClinVar_CLNDN

to give neatly

#Uploaded_variation	Gene	Consequence	ClinVar_CLNSIG	ClinVar_CLNREVSTAT	ClinVar_CLNDN
1_154426970_A/C	ENSG00000160712	intron_variant	association	no_assertion_criteria_provided	Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus
1_154426970_A/C	ENSG00000160712	missense_variant	association	no_assertion_criteria_provided	Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus
1_154426970_A/C	ENSG00000160712	downstream_gene_variant	association	no_assertion_criteria_provided	Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus
1_154426970_A/C	ENSG00000160712	non_coding_transcript_exon_variant	association	no_assertion_criteria_provided	Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus
1_154426970_A/C	ENSG00000160712	non_coding_transcript_exon_variant	association	no_assertion_criteria_provided	Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus
1_154426970_A/C	ENSG00000160712	missense_variant	association	no_assertion_criteria_provided	Interleukin_6,_serum_level_of,_quantitative_trait_locus|Soluble_interleukin-6_receptor,_serum_level_of,_quantitative_trait_locus

dbNSFP

Web page: https://sites.google.com/site/jpopgen/dbNSFP.

This is set up as follows,

wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.1a.zip
unzip dbNSFP4.1a.zip -d dbNSFP4.1a
cd dbNSFP4.1a
# code for 4.0a
# zcat dbNSFP4.0a_variant.chr1.gz | head -n1 > h
# zgrep -h -v ^#chr dbNSFP4.0a_variant.chr* | sort -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP4.0a.gz
# efficient version
(
  zcat dbNSFP4.1a_variant.chr1.gz | head -n1
  export prefix=dbNSFP4.1a_variant.chr
  echo ${prefix}{1..22}.gz ${prefix}M.gz ${prefix}X.gz ${prefix}Y.gz | \
  xargs -I {} bash -c "zcat {} | sed '1d'"
) | bgzip -c > dbNSFP4.1a.gz
tabix -s 1 -b 2 -e 2 dbNSFP4.1a.gz
cd -
vep -i examples/homo_sapiens_GRCh37.vcf -o test --cache --force_overwrite --offline \
    --plugin dbNSFP,dbNSFP4.1a/dbNSFP4.1a.gz,LRT_score,FATHMM_score,MutationTaster_score

Since this release is frozen on Ensembl 94's transcript set, one may prefer to use it independently via its Java programs, e.g.,

java -jar search_dbNSFP41a.jar -i tryhg19.in -o tryhg19.out -v hg19
java -jar search_dbNSFP41a.jar -i tryhg38.in -o tryhg38.out

GeneSplicer

Web: https://ccb.jhu.edu/software/genesplicer/.

Setup

wget -qO- ftp://ftp.ccb.jhu.edu/pub/software/genesplicer/GeneSplicer.tar.gz | \
tar xvfz -
mv GeneSplicer.pm ~/.vep/Plugins
./vep -i variants.vcf --plugin GeneSplicer,[path_to_genesplicer_bin],[path_to_training_dir],[option1=value],[option2=value]

Reference

M. Pertea , X. Lin , S. L. Salzberg. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 2001 Mar 1;29(5):1185-90.

loftee

GitHub page: https://github.com/konradjk/loftee.

Reference: MacArthur DG et al (2012). A systematic survey of loss-of-function variants in human protein-coding genes. Science 335:823–828

It is actually part of the standard VEP plugins.

cd loftee
# human_ancestor_fa
## samtools --version gives 0.1.19
wget http://www.broadinstitute.org/~konradk/loftee/human_ancestor.fa.rz
wget http://www.broadinstitute.org/~konradk/loftee/human_ancestor.fa.rz.fai
## samtools --version gives 1.x
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi
# conservation_file -- note the data, schema and GERP files are required only to build the sql file
wget -qO- https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/phylocsf_gerp.sql.gz | \
gunzip -c > phylocsf_gerp.sql
# wget https://www.broadinstitute.org/~konradk/loftee/phylocsf_data.tsv.gz
# wget https://www.broadinstitute.org/~konradk/loftee/phylocsf_data_schema.sql
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/GERP_scores.final.sorted.txt.gz
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/GERP_scores.exons.txt.gz
# annotation
vep --id "1 154426970 154426970 A/C 1" --species homo_sapiens -o rs2228145 --cache --offline --force_overwrite \
    --assembly GRCh37 --plugin LoF,loftee_path:.,human_ancestor_fa:human_ancestor.fa.gz,conservation_file:phylocsf_gerp.sql.gz
# VEP documentation example
vep --input_file homo_sapiens_GRCh37.vcf --output_file test --cache --dir_cache ${HPC_WORK}/ensembl-vep/.vep --dir_plugins ${HPC_WORK}/loftee --offline \
    --pick --force_overwrite --species homo_sapiens --assembly GRCh37 \
    --plugin LoF,loftee_path:.,human_ancestor_fa:human_ancestor.fa.gz,conservation_file:phylocsf_gerp.sql

If offers implementation for parsing CSQ field but is also possible with R as described below. Note that if loftee_path uses an absolute path, that path should also be within PERL5LIB, e.g.,

export PERL5LIB=$PERL5LIB:$HPC_WORK/loftee

is put in .bashrc. We see now all files are provided, and there are complaints over rz was not generated by bgzip, so we do

# error
samtools faidx human_ancestor.fa.rz
# rework
gunzip -c human_ancestor.fa.rz | bgzip -c > output.fa.gz
# check and put it back
ll human_ancestor.fa.rz output.fa.gz
mv output.fa.gz human_ancestor.fa.rz
# now possible and check
samtools faidx human_ancestor.fa.rz
ll human_ancestor.fa.*

BigWig file

One can have additional features installed such as JSON, Set::IntervalTree, Bio::DB::BigFile, PerlIO::gzip and ensembl-xs. We exemplify JSON and Bio::DB::BigFile here,

Also see https://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#bigfile.

# 1. Download and extract source code
cd $HOME
wget -qO- https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz | \
tar xzf -
# 2. Set up compiling flags
export KENT_SRC=$HOME/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
# 3. Amend parameters
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
# 4. Build library
make clean && make
cd ../jkOwnLib
make clean && make
# 5. On Mac OSX
ln -s $KENT_SRC/lib/x86_64/* $KENT_SRC/lib/
# 6. Install Perl modules
cd ${HPC_WORK}/ensembl-vep
cpan JSON
cpan Bio::DB::BigFile
# 7. Test
perl -Imodules t/AnnotationSource_File_BigWig.t

We have from step 7 above

ok 1 - use Bio::EnsEMBL::VEP::AnnotationSource::File;
ok 2 - use Bio::EnsEMBL::VEP::AnnotationSource::File::BigWig;
ok 3 - use Bio::EnsEMBL::VEP::Config;
ok 4 - get new config object
ok 5 - new is defined
Couldn't open foo
ok 6 - new with invalid file throws
ok 7 - use Bio::EnsEMBL::VEP::Parser::VCF;
ok 8 - get parser object
ok 9 - use Bio::EnsEMBL::VEP::InputBuffer;
ok 10 - check class
ok 11 - check buffer next
ok 12 - annotate_InputBuffer - overlap
ok 13 - annotate_InputBuffer - exact, additive
ok 14 - annotate_InputBuffer - out by 1 (5')
ok 15 - annotate_InputBuffer - out by 1 (3')
ok 16 - overlap fixedStep
1..16

We have the GRCh38 branch, git clone --depth 1 --branch grch38 https://github.com/konradjk/loftee.git.

# GRCh38 files
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.fai
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.gzi
# wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz

# Annotation
export ENSEMBL=~/rds/rds-jmmh2-public_databases/ensembl-vep
export LOFTEE38=${ENSEMBL}/loftee/loftee_data/GRCh38
export LOFTEE38GERP=${LOFTEE38}/gerp_conservation_scores.homo_sapiens.GRCh38.bw
export LOFTEE38HA=${LOFTEE38}/human_ancestor.fa.gz
export LOFTEE38SQL=${LOFTEE38}/loftee.sql
export rds=..  # ~/rds/rds-jmmh2-public_databases/ensembl-vep will be user-specific
vep --input_file for_VEP.txt --format ensembl --output_file ${rds}/for_VEP_output2.txt --force_overwrite --offline --symbol --merged \
    --fasta Homo_sapiens.GRCh38.dna.toplevel.fa --dir_cache ${rds}/.vep --dir_plugins . \
    --protein --symbol --tsl --canonical --mane_select --biotype --check_existing --sift b --polyphen b \
    --plugin LoF,loftee_path:.,gerp_bigwig:${LOFTEE38GERP},human_ancestor_fa:${LOFTEE38HA},conservation_file:${LOFTEE38SQL}

See also https://docs.databricks.com/applications/genomics/secondary/vep-pipeline.html.

REVEL

REVEL: Rare Exome Variant Ensemble Learning

Plugin: https://m.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#revel

Data: https://sites.google.com/site/revelgenomics/downloads.

Setup

mkdir REVEL
cd REVEL
wget https://rothsj06.u.hpc.mssm.edu/revel-v1.3_all_chromosomes.zip
unzip revel-v1.3_all_chromosomes.zip
cat revel_with_transcript_ids | tr "," "\t" > tabbed_revel.tsv
sed '1s/.*/#&/' tabbed_revel.tsv > new_tabbed_revel.tsv
bgzip new_tabbed_revel.tsv
# GRCh37:
tabix -f -s 1 -b 2 -e 2 new_tabbed_revel.tsv.gz
# GRCh38:
zcat new_tabbed_revel.tsv.gz | head -n1 > h
zgrep -h -v ^#chr new_tabbed_revel.tsv.gz | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat h - | bgzip -c > new_tabbed_revel_grch38.tsv.gz
tabix -f -s 1 -b 3 -e 3 new_tabbed_revel_grch38.tsv.gz

Example

vep --input_file <input> --plugin REVEL,${ENSEMBL}/.vep/Plugins/new_tabbed_revel.tsv.gz

Reference

Ioannidis NM, et al. REVEL: An ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet 2016; 99(4):877-885. http://dx.doi.org/10.1016/j.ajhg.2016.08.016

— R —

This is a wrapper and the Ensembl VEP perl script must be installed in your path. Expected to be slower than the --offline mode above, it is relatively easy to set up,

BiocManager::install("ensemblVEP")
vignette("ensemblVEP")
library(ensemblVEP)
file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
param <- VEPFlags(flags=list(vcf=TRUE,check_existing=TRUE,symbol=TRUE,terms="SO",sift="b",polyphen="p"))
vep <- ensemblVEP(file, param)
info(vep)$CSQ
csq <- parseCSQToGRanges(vep)
head(csq, 2)
# clinvar
param <- VEPFlags(flags=list(vcf=TRUE,output_file="clinvar.vcf",force_overwrite=TRUE,assembly="GRCh37",port=3337,
                  custom="clinvar_GRCh37.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN"))
ensemblVEP(file, param)

The second ensemblVEP obtains clinvar.vcf and clinvar.vcf_summary.html. Annotation is made to a VCF file, and returns data with unparsed ‘CSQ'.

The facility to parse the CSQ column of a VCF object is done for the documentation example.

# VCF output from the VEP web interface or the call above
vep <- "INF1.merge.trans.vcf"
# Parse into a GRanges and include the 'VCFRowID' column.
vcf <- readVcf(vep, "hg19")
csq <- parseCSQToGRanges(vep, VCFRowID=rownames(vcf))
write.table(mcols(csq),file="INF1.merge.trans.txt", quote=FALSE, sep="\t")

The dbNSFP counterpart is also possible

BiocManager::install("myvariant")
library(VariantAnnotation)
file <- system.file("extdata", "dbsnp_mini.vcf", package="myvariant")
vcf <- readVcf(file, genome="hg19")
rowRanges(vcf)
library(myvariant)
hgvs <- formatHgvs(vcf, variant_type="snp")
head(hgvs)
getVariants(hgvs)
rsids <- paste("rs", info(vcf)$RS, sep="")
head(rsids)
res <- queryVariants(q=rsids, scopes="dbsnp.rsid", fields="all")
fields <- names(res)
cadd <- grep('cadd',fields)
res[fields[cadd]]

Note that the CSQ field could also be handled by bcftools split-vep plugin, see http://samtools.github.io/bcftools/howtos/plugin.split-vep.html.

— docker —

See docker/Dockerfile from the GitHub directory above, or https://github.com/Ensembl/ensembl-vep.

— Virtual machine —

See http://www.ensembl.org/info/data/virtual_machine.html which is possibly best for MicroSoft Windows and is not pursued here.

ENSEMBL-synonym translation (hg19) file