polyphen-2
Official page: http://genetics.bwh.harvard.edu/pph2/dokuwiki/start.
The setup can be furnished as follows,
cd $HPC_WORK
wget -qO- http://genetics.bwh.harvard.edu/pph2/dokuwiki/_media/polyphen-2.2.2r405c.tar.gz | tar xfz
wget -qO- ftp://genetics.bwh.harvard.edu/pph2/bundled/polyphen-2.2.2-databases-2011_12.tar.bz2 | tar xjf
wget -qO- ftp://genetics.bwh.harvard.edu/pph2/bundled/polyphen-2.2.2-alignments-mlc-2011_12.tar.bz2 | tar xjf
wget -qO- ftp://genetics.bwh.harvard.edu/pph2/bundled/polyphen-2.2.2-alignments-multiz-2009_10.tar.bz2 | tar xjf
ls | sed 's/\*//g' | parallel -C' ' 'ln -sf $HPC_WORK/polyphen-2.2.2/bin/{} $HPC_WORK/bin/{}'
cd polyphen-2.2.2
# set up BLAST/nrdb/PDB as decribed below
cd src
make
make install
cd -
configure
cd bin
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ ./
cd -
The MLC/MULTIZ databases need to be extracted to $HOME and symbolically linked if the number of files exceed 1 million (limit on RDS). Then these are necessary,
cd $HPC_WORK/polyphen-2.2.2
ln -s $HOME/polyphen-2.2.2/precompiled
cd ucsc/hg19/multiz
ln -s $HOME/polyphen-2.2.2/ucsc/hg19/multiz/precomputed
The availability of MLC/MULTIZ databases make the annotation considerably faster.
The command configure
creates files at config/ which can be changed maunaually. There is also
user's guide. The line rsync
obtains
programs such as twoBitToFa
as required by the example below.
BLAST and nrdb can be set up as follows,
rmdir blast
ln -sf /usr/local/Cluster-Apps/blast/2.4.0 blast
cd nrdb
wget -qO- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref100/uniref100.fasta.gz | \
gunzip -c > uniref100.fasta
../update/format_defline.pl uniref100.fasta >uniref100-formatted.fasta
../blast/bin/makeblastdb -in uniref100-formatted.fasta -dbtype prot -out uniref100 -parse_seqids
rm -f uniref100.fasta uniref100-formatted.fasta
and for PDB
rsync -rltv --delete-after --port=33444 \
rsync.wwpdb.org::ftp/data/structures/divided/pdb/ wwpdb/divided/pdb/
rsync -rltv --delete-after --port=33444 \
rsync.wwpdb.org::ftp/data/structures/all/pdb/ wwpdb/all/pdb/
Our test is then,
cd $HPC_WORK/polyphen-2.2.2
run_pph.pl sets/test.input 1>test.pph.output 2>test.pph.log
run_weka.pl test.pph.output >test.humdiv.output
run_weka.pl -l models/HumVar.UniRef100.NBd.f11.model test.pph.output >test.humvar.output
sdiff test.humdiv.output sets/test.humdiv.output
sdiff test.humvar.output sets/test.humvar.output
Now we turn to an genomic SNPs query examples with snps.pph.list containing the following line,
chr1:154426970 A/C
to be called by mapsnps.pl
and others.
mapsnps.pl -g hg19 -m -U -y snps.pph.input snps.pph.list 1>snps.pph.features 2>snps.log
run_pph.pl snps.pph.input 1>snps.pph.output 2>snps.pph.log
run_weka.pl snps.pph.output >snps.humdiv.output
run_weka.pl -l models/HumVar.UniRef100.NBd.f11.model snps.pph.output >snps.humvar.output
for .pph.input, .pph.features, .pph.output, .humvar.output and .humdiv.output.