haplogrep

Web: https://haplogrep.i-med.ac.at/ (haplogrep2) & https://haplogrep.readthedocs.io/en/latest/

3.2.2

Installation

# haplogrep3
wget https://github.com/genepi/haplogrep3/releases/download/v3.2.2/haplogrep3-3.2.2-linux.zip
unzip haplogrep3-3.2.2-linux.zip
haplogrep3 server --config haplogrep3.yaml

Haplogrep trees

Web: https://genepi.github.io/haplogrep-trees/.

# check available trees
haplogrep3 trees
# add a tree
haplogrep install-tree phylotree-rcrs@17.2
# show macrohaplogroups
haplogrep3 cluster-haplogroups --output macro-rcrs.txt --tree phylotree-rcrs@17.2

LD between nDNA-mtDNA

LD for two specific SNPs can be done with –snp option (c.f. –extract a list from files) as follows,

#!/usr/bin/bash

export autosomes=~/rds/post_qc_data/uk_biobank/genotype/genotype/affy_ukbiobank_array/QCd_data/
export mtdna=~/rds/post_qc_data/uk_biobank/mtdna/genotyped/genotyped/

plink --bfile $autosomes/QCd_Eur --snp 1:723307_C_G --make-bed --out 1
plink --bfile $mtdna/UKBB_recaled_clean_4753out --snp Affx-79504644 --make-bed --out 2
plink --bfile 1 --bmerge 2 --make-bed --out 12
plink --bfile 12 --ld 1:723307_C_G Affx-79504644

Example

1. Documentation data

haplogrep3 classify --in data/examples/example-microarray.vcf --out microarray.txt --tree=phylotree-rcrs@17.0
haplogrep3 classify --in data/examples/example-wgs.vcf --out wgs.txt --tree=phylotree-rcrs@17.0
haplogrep3 classify --in trees/phylotree-rcrs/17.2/rcrs.fasta --out rcrs.txt --tree phylotree-rcrs@17.2

2. vcf.gz

This enables handling vcf.gz via vcf.gz.py for large samples (N=358,916 here),

module load ceuadmin/haplogrep
export UKB=~/rds/post_qc_data/uk_biobank/mtdna/imputed/imputed
haplogrep3 classify --in $UKB/UKBB_UKBL_binary.vcf.gz --out UKBL_binary --tree=phylotree-rcrs@17.2
aource ~/rds/software/py3.11/bin/activate
python vcf.gz.py UKBB_shortIDs.vcf.gz sloan15_input.txt &
sed 's/"//g' UKBB_binary | \
awk '{ gsub(/[^a-zA-Z0-9]/,"", $2); print $1 "\t" $2 }' | \
sed 's/ \+/\t/'> clean_haplogroups.txt
sloan15.pl sloan15_input.txt clean_haplogroups.txt > sloan15.tsv
 Rscript -e '
 ld <- read.delim("sloan15.tsv",check=FALSE)
 summary(ld$r2)
'
deactivate

3. SLURM

The haplogrep3 step above is time-consuming and a SLURM job is used as follows,

#!/usr/bin/bash

#SBATCH --account PETERS-SL3-CPU
#SBATCH --partition icelake
#SBATCH --mem=28800
#SBATCH --time=12:00:00
#SBATCH --job-name=_hg
#SBATCH --output=hg.o
#SBATCH --error=hg.e

. /etc/profile.d/modules.sh
module purge
module load rhel8/default-icl

export TMPDIR=${HPC_WORK}/work

module load ceuadmin/haplogrep
export UKB=~/rds/post_qc_data/uk_biobank/mtdna/imputed/imputed
haplogrep3 classify --in $UKB/UKBB_UKBL_binary.vcf.gz --out UKBL_binary --tree=phylotree-rcrs@17.2

We see a fine-grid haplogroupings (UKBL_binary), with lengthy sample IDs shortened according to vcf.gz with renum.sh, are collapsed into macrohaplogroups as follows.

library(dplyr)

macro_map <- read.delim("macro-rcrs.txt", header = TRUE, quote = "\"", stringsAsFactors = FALSE)
hg <- read.delim("clean_haplogroups.txt", stringsAsFactors = FALSE)

hg2 <- hg %>%
  left_join(macro_map, by = c("Haplogroup" = "Haplogroup")) %>%
  rename(Macro = Super.Haplogroup)

hg2$Macro[is.na(hg2$Macro)] <- "Unmapped"

print(table(hg2$Macro))
print(round((prop.table(table(hg2$Macro))*100+0.5)/100, 2))
write.table(hg2, "haplogroups_with_macro.txt", sep = "\t", row.names = FALSE)

whose counts and proportions are obtained,

> print(table(hg2$Macro))

       B        C        D        F        G        H       HV        I
       8       30       35        9        7   153014     1217    12687
       J        K       L0       L1       L2       L3       L4       L5
   41237    30550       35        1       27       18        3       22
       M        N        P        Q        R        T        U Unmapped
      47      161        7        1       85    31210    36329    36737
       V        W        X
    3005     7293     5141

> print(round((prop.table(table(hg2$Macro))*100+0.5)/100, 2))

       B        C        D        F        G        H       HV        I
    0.01     0.01     0.01     0.01     0.01     0.43     0.01     0.04
       J        K       L0       L1       L2       L3       L4       L5
    0.12     0.09     0.01     0.01     0.01     0.01     0.01     0.01
       M        N        P        Q        R        T        U Unmapped
    0.01     0.01     0.01     0.01     0.01     0.09     0.11     0.11
       V        W        X
    0.01     0.03     0.02

and for $r^2$ we now have

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's
0.000003 0.000285 0.001771 0.005593 0.007300 0.048756        7

4. gnomAD

This section is for note-keeping only.

# gnomAD 3.1 chrM.vcf.bgz has no samples
gunzip -c gnomad.genomes.v3.1.sites.chrM.vcf.bgz > chrM.vcf
haplogrep3 classify --in chrM.vcf --out haplogrep_output.txt --tree phylotree-rcrs@17.0
# hail dense/sparse MatrixTable -- conceptual steps, too large to be tested!
module load openjdk/11.0.12_7/gcc/czpuqhmv
python <<END
gsutil -m cp -r \
gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_sparse.mt .
mt = hl.read_matrix_table("gnomad.genomes.v3.1.2.hgdp_1kg_subset_sparse.mt")
mt = mt.filter_rows(mt.locus.contig == 'chrM')
hl.export_vcf(mt, "chrM.vcf.bgz")
mt_small = mt.sample_rows(0.1)  # Keep 10% of variants
mt_small = mt_small.sample_cols(0.01)  # Keep 1% of samples
hl.export_vcf(mt_small, "chrM_subset.vcf.bgz")
END

2.4.0

Installation

# cmd only
curl -sL haplogrep.now.sh | bash
# GitHub
wget -qO- https://github.com/seppinho/haplogrep-cmd/archive/refs/tags/v2.4.0.tar.gz | tar xfz -
cd haplogrep-cmd-2.4.0/
bash install/github-downloader.sh
# test
wget https://github.com/seppinho/haplogrep-cmd/raw/master/test-data/vcf/HG00097.vcf.gz
./haplogrep classify --in HG00097.vcf.gz --format vcf --out haplogroups.txt

Example

The following examples concern about LDs on mtDNA and nDNA with mtDNA, which offer glimpse of findings from sloan, et al. (2015), https://royalsocietypublishing.org/doi/suppl/10.1098/rspb.2015.1704 as well as a generic implementation via PLINK.

This is also the documentation example (VCF), for which we have $0.019 \ge r^2 \le 0.024$.

python3 < hgdp.py
module load ceuadmin/haplogrep/2.4.0
haplogrep classify --in data/examples/example-wgs.vcf --out hg.txt --format=vcf
# suppress double quote, asterisk and plus ("*+)
sed -i 's/"//g' hg.txt
awk '{ gsub(/[^a-zA-Z0-9]/,"", $2); print $1 "\t" $2 }' hg.txt > hg_simple_clean.txt
sloan15.pl sloan15_input.txt hg_simple_clean.txt > sloan15.tsv
Rscript -e '
 ld <- read.delim("sloan15.tsv",check=FALSE)
 summary(ld)
'

where hgdp.py is used to reformat the data to a required format by sloan15.pl.