bcftools

Web: http://www.htslib.org/download/

1.20

Several plugins are now available, see https://github.com/freeseek/score.

if [ ! -d $CEUADMIN/bcftools/1.20 ]; then mkdir -p $CEUADMIN/bcftools/1.20; fi
cd $CEUADMIN/bcftools/1.20
export PERL5LIB=
wget -qO- https://github.com/samtools/bcftools/releases/download/1.20/bcftools-1.20.tar.bz2 | \
tar xjf -
cd bcftools-1.20/
configure --prefix=$CEUADMIN/bcftools/1.20
make
make install
cd -
wget -P bin https://raw.githubusercontent.com/freeseek/score/master/assoc_plot.R
chmod a+x bin/assoc_plot.R
mkdir score && cd score
wget https://software.broadinstitute.org/software/score/score_1.20-20240505.zip
cd -

The setup of bcftools +liftover is detailed here,

module load bwa
module load ceuadmin/samtools
export public_databases=/rds/project/rds-4o5vpvAowP0
export TMPDIR=/rds/user/$USER/work

# GRCh37 human genome reference, cytoband and chain file
cd $public_databases/GRCh37_reference_fasta
wget https://hgdownload.cse.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz
gunzip hs1.fa.gz
cd $public_databases/dbsnp
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | \
gzip -d > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta
bwa index human_g1k_v37.fasta
wget -P hg19 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg19.over.chain.gz

# GRCh38 human genome reference, cytoband and chain files
export url=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
wget -O- $url/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
bwa index GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
wget -P hg38 http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg38.over.chain.gz
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz
wget http://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz
wget https://hgdownload.cse.ucsc.edu/gbdb/hg38/liftOver/hg38ToHs1.over.chain.gz
wget https://hgdownload.cse.ucsc.edu/goldenPath/hs1/liftOver/hs1ToHg38.over.chain.gz

The bwa will be killed from an interactive session, so needs to be replaced,

#!/bin/bash

#SBATCH --job-name=_bwa
#SBATCH --account=PETERS-SL3-CPU
#SBATCH --partition=icelake-himem
#SBATCH --mem=28800
#SBATCH --time=12:00:00
#SBATCH --cpus-per-task=4
#SBATCH --output=bwa.o
#SBATCH --error=bwa.e

. /etc/profile.d/modules.sh
module purge
module load rhel8/default-icl

export TMPDIR=${HPC_WORK}/work

module load bwa
bwa index human_g1k_v37.fasta
bwa index GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

in a named file such as bwa.sb and executed with sbatch bwa.sb.

ceuadmin/qcftools/1.20

The module can be enabled in several ways,

export BCFTOOLS_PLUGINS=$CEUADMIN/bcftools/1.20/score && bcftools +score
export BCFTOOLS_PLUGINS=$CEUADMIN/bcftools/1.20/score && bcftools plugin score
bcftools +$BCFTOOLS_PLUGINS/score.so
bcftools plugin $BCFTOOLS_PLUGINS/score.so

Its use is shown as follows,

module load ceuadmin/bcftools/1.20
bcftools plugin --list
bcftools --version
bcftools +score
bcftools +munge
bcftools +liftover
bcftools +pgs
bcftools +blup
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi
bcftools +liftover --no-version \
 -Ou $public_databases/dbsnp/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz -- \
  -s $public_databases/dbsnp/human_g1k_v37.fasta \
  -f $public_databases/dbsnp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -c $public_databases/dbsnp/hg18ToHg38.over.chain.gz \
  --reject ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.reject.bcf \
  --reject-type b \
  --write-src | \
bcftools sort -o ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.hg38.bcf -Ob --write-index

and a 1kGP_high_coverage_Illumina.sites.bcf uses SLURM script site.sb (based on author of bcftools/liftover),

An example application

The data used here is from ensembl-vep/examples,

#!/usr/bin/bash

module load ceuadmin/bcftools/1.20

export public_databases=/rds/project/rds-4o5vpvAowP0
cp -p $public_databases/ensembl-vep/examples/homo_sapiens_GRCh3?.vcf .

# GRCh37
sed  -i '1!s/^21/chr21/' homo_sapiens_GRCh37.vcf
sed  -i '1!s/^22/chr22/' homo_sapiens_GRCh37.vcf
bgzip homo_sapiens_GRCh37.vcf
tabix -S1 -s1 -b2 -e2 homo_sapiens_GRCh37.vcf.gz
# GRCh38
sed  -i '1!s/^21/chr21/' homo_sapiens_GRCh38.vcf
sed  -i '1!s/^22/chr22/' homo_sapiens_GRCh38.vcf
bgzip homo_sapiens_GRCh38.vcf
tabix -S1 -s1 -b2 -e2 homo_sapiens_GRCh38.vcf.gz

bcftools norm --no-version -Ou -m+ homo_sapiens_GRCh37.vcf.gz | \
bcftools +liftover --no-version -Ou -- \
  -s $public_databases/dbsnp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -f $public_databases/GRCh37_reference_fasta/hs1.fa \
  -c $public_databases/dbsnp/hg19ToHg38.over.chain.gz | \
bcftools sort -o homo_sapiens_GRCh38.hs1.bcf -Ob --write-index
sdiff -s <(bcftools query --format "%CHROM\t%POS\n" homo_sapiens_GRCh38.hs1.bcf -r chr21) \
         <(bcftools query --format "%CHROM\t%POS\n" homo_sapiens_GRCh38.vcf.gz -r chr21)

where there are two notable aspects:

  • We first change chromosome names from 21. 22 to chr21, chr22. The bcftools liftover plugin generates a .bcf file which is used to contrast with the provided example; since the two files all have the same coordinates we don't see any output.
  • Should indels be split into SNVs, then bcftools norm -m+ would rebuild them for the liftover, after which they could also be split again into bi-allelic variants.

Notes on coupling

They will be useful for compiling from source and it is easier to use conda for an end user,

# cholmod for pgs
# conda install bcftools 'libopenblas=*=*openmp*' suitesparse
# conda install mkl llvm-openmp
wget -P bcftools-1.20 https://raw.githubusercontent.com/DrTimothyAldenDavis/SuiteSparse/stable/{SuiteSparse_config/SuiteSparse_config,CHOLMOD/Include/cholmod}.h
cd bcftools-1.20/
/bin/rm -f score/{score.{c,h},{munge,liftover,metal,blup}.c,pgs.{c,mk}}
wget -P score https://raw.githubusercontent.com/freeseek/score/master/{score.{c,h},{munge,liftover,metal,blup}.c,pgs.{c,mk}}
# remove pgs if necessary
# /bin/rm score/pgs.{c,mk}
make
# /bin/cp bcftools score/{munge,liftover,score,metal,pgs,blup}.so bin

1.12

On CSD3, module avail bcftools gives a list of versions but none can query data from the web.

However, the installation is straightforward.

wget -qO- https://github.com/samtools/bcftools/releases/download/1.12/bcftools-1.12.tar.bz2 | \
tar xjf -
cd bcftools-1.12/
./configure --prefix=${HPC_WORK}
make
make install
bcftools --version

According to documentation, the configuration could slightly be more complicated with

export BCFTOOLS_PLUGINS=${HPC_WORK}/bcftools-1.12/plugins
autoheader && autoconf && ./configure --enable-libgsl --enable-perl-filters --prefix=/usr/local/Cluster-Apps/ceuadmin/bcftools/1.12
make
make install

The data query example as in tabix is quoted here.

bcftools query -f '%ID\t%ALT\t%REF\t%AF\t[%ES]\t[%SE]\t[%LP]\t[%SS]\t%CHROM\t%POS\n' -r 1:1-1000000 \
               https://gwas.mrcieu.ac.uk/files/ebi-a-GCST010776/ebi-a-GCST010776.vcf.gz

A number of other options can be used together with -r or -R.

We can also obtain data with the header (-H)

export f=ebi-a-GCST010776
bcftools query -f '%ID\t%ALT\t%REF\t%AF\t[%ES]\t[%SE]\t[%LP]\t[%SS]\t%CHROM\t%POS\n' \
               -H \
               -r 1:1-1000000 \
               https://gwas.mrcieu.ac.uk/files/${f}/${f}.vcf.gz | \
               sed "s/\[[0-9]*\]//g;s/^[\#] //;s/${f}://g" > chr1:1-1000000.dat