CookHLA

Web: https://github.com/WansonChoi/CookHLA

Installation

According to the GitHub page above, the following steps are required.

# 0. CookHLA
git clone https://github.com/WansonChoi/CookHLA.git
cd CookHLA
# 1. PLINK 1.9
module load plink-1.9-gcc-5.4.0-sm3ojoi
mkdir dependency
cd dependency
# 2. mach1
wget -qO- https://csg.sph.umich.edu/abecasis/MACH/download/mach.1.0.18.Linux.tgz | \
tar xvfz -
# 3. beagle 5.1
wget https://faculty.washington.edu/browning/beagle/beagle.18May20.d20.jar -O beagle.jar
# 4. beagle utilities
wget https://faculty.washington.edu/browning/beagle_utilities/beagle2linkage.jar
wget https://faculty.washington.edu/browning/beagle_utilities/beagle2vcf.jar
wget https://faculty.washington.edu/browning/beagle_utilities/linkage2beagle.jar
wget https://faculty.washington.edu/browning/beagle_utilities/vcf2beagle.jar
wget https://faculty.washington.edu/browning/beagle_utilities/transpose.jar
# 6. Python
source ~/COVID-19/py37/bin/activate
pip install pyliftover==0.4

where ~/COVID-19/py37 is our Python virtual environment under Python version 3.7, and pyliftover is the Python liftover package.

Nevertheless, the dependency directory already contains plink, beagle5.jar (can be renamed into beagle.jar), mach1 and all BEAGLE utilities.

Example

The example mirrors those in SNP2HLA1,

# Adaptive Genetic Map
python -m MakeGeneticMap \
       -i example/1958BC.hg19 \
       -hg 19 \
       -ref 1000G_REF/1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC \
       -o work/1958BC+1000G_REF.EUR

# Imputation
python CookHLA.py \
       -i example/1958BC.hg19 \
       -hg 19 \
       -o work/1958BC+HM_CEU_REF \
       -ref example/HM_CEU_REF \
       -gm example/AGM.1958BC+HM_CEU_REF.mach_step.avg.clpsB \
       -ae example/AGM.1958BC+HM_CEU_REF.aver.erate \
       -mem 20g \
       -mp 8

The imputation gives 1958BC+HM_CEU_REF.MHC.HLA_IMPUTATION_OUT .alleles and .hped files, which is handled with HATK2 and HLA-TAPAS3. The software also takes output from HIBAG4, among others.

1000Genomes

We are rather tempted to use these from CookHLA with SNP2HLA, and follow the footnote on SNP2HLA we have,

csh SNP2HLA.csh 1958BC 1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC plink
plink --noweb --dosage 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage noheader format=1 \
      --fam 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.fam \
      --linear --out 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage.assoc

we obtain the screen output,

SNP2HLA: Performing HLA imputation for dataset 1958BC
- Java memory = 2000Mb
- Beagle window size = 1000 markers
[1] Extracting SNPs from the MHC.
[2] Performing SNP quality control.
[3] Convering data to beagle format.
[4] Performing HLA imputation (see 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.bgl.log for progress).
[5] Converting posterior probabilities to PLINK dosage format.
[6] Converting imputation genotypes to PLINK .ped format.
DONE!

PLINK v1.90p 64-bit (8 Nov 2021)               www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage.assoc.log.
Options in effect:
  --dosage 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage noheader format=1
  --fam 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.fam
  --linear
  --noweb
  --out 1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage.assoc

Note: --dosage automatically performs a regression; --linear/--logistic has no
additional effect.
Note: --noweb has no effect since no web check is implemented yet.
257130 MB RAM detected; reserving 128565 MB for main workspace.
10 people (9 males, 1 female) loaded from .fam.
10 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
10 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 10 are controls.
--dosage: Reading from
1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage.
--dosage: Results saved to
1958BC_IMPUTED_1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.dosage.assoc.assoc.dosage

References

Choi W, Luo Y, Raychaudhuri S & Han B HATK: HLA analysis toolkit. Bioinformatics 37, 416-418 (2020).

Cook S, et al. Accurate imputation of human leukocyte antigens with CookHLA. Nat Comm 12, 1264 (2021).

Degenhardt F, et al. Construction and benchmarking of a multi-ethnic reference panel for the imputation of HLA class I and II alleles. Hum Mol Genet 28, 2078-2092 (2018).

Jia X, et al. Imputing Amino Acid Polymorphisms in Human Leukocyte Antigens. PLOS ONE 8, e64683 (2013).

Immuno Polymorphism Database-international ImMunoGeneTics project (IMGT) (IPD-IMGT/HLA), https://www.ebi.ac.uk/ipd/imgt/hla/

Luo, Y, et al. A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet 53, 1504–1516 (2021), https://doi.org/10.1038/s41588-021-00935-7.

WHO Committe. Nomenclature for Factors of the HLA System, http://hla.alleles.org/.

Zheng X, et al. HIBAG—HLA genotype imputation with attribute bagging. The Pharmacogenomics J 14, 192-200 (2014), HLARES.


  1. SNP2HLA

    Web: SNP2HLA v1.0.3 (BEAGLE utitlities)

    Installation

    It turns out to be difficult to download beagle 3.0.4 as indicated so it is included in the files/ directory (beagle 3.0.4, documentation, example).

    wget -qO- https://software.broadinstitute.org/mpg/snp2hla/data/SNP2HLA_package_v1.0.3.tar.gz | \
    tar xvfz -
    cd SNP2HLA_package_v1.0.3/SNP2HLA
    # add beagle2linkage.jar as above
    # test.sh is adapted from SNP2HLA.csh by removing argument checking and as an executable.
    # The .dos file described in README.txt is actually the .dosage file generated internally from test.sh
    # The association statistics will be in .assoc.assoc.dosage; the double .assoc guarantees an .assoc.log file.
    csh SNP2HLA.csh 1958BC HM_CEU_REF 1958BC_IMPUTED plink
    csh ParseDosage.csh 1958BC_IMPUTED.bgl.gprobs > 1958BC_IMPUTED.bgl.dos
    plink --noweb --dosage 1958BC_IMPUTED.dosage noheader format=1 --fam 1958BC_IMPUTED.fam --logistic --out 1958BC_IMPUTED.assoc
    

    As indicated, we call the .csh scripts with csh which is actually tsch on csd3.

    Example

    The screen output is as follows,

      SNP2HLA: Performing HLA imputation for dataset 1958BC
    - Java memory = 2000Mb
    - Beagle window size = 1000 markers
    [1] Extracting SNPs from the MHC.
    [2] Performing SNP quality control.
    [3] Convering data to beagle format.
    [4] Performing HLA imputation (see 1958BC_IMPUTED.bgl.log for progress).
    [5] Converting posterior probabilities to PLINK dosage format.
    [6] Converting imputation genotypes to PLINK .ped format.
    DONE!
    

    The output is 1958BC_IMPUTED in PLINK binary format.

    MakeReference

    We first create MakeReference.tcsh such that calls to .pl scripts are prefixed with perl, so that

    diff MakeReference.csh MakeReference.tcsh
    
    77c77
    <     ./HLAtoSequences.pl $HLA_DATA HLA_DICTIONARY_AA.txt AA > $OUTPUT.AA.ped
    ---
    >     perl HLAtoSequences.pl $HLA_DATA HLA_DICTIONARY_AA.txt AA > $OUTPUT.AA.ped
    81c81
    <     ./encodeVariants.pl $OUTPUT.AA.ped $OUTPUT.AA.map $OUTPUT.AA.CODED
    ---
    >     perl encodeVariants.pl $OUTPUT.AA.ped $OUTPUT.AA.map $OUTPUT.AA.CODED
    93c93
    <     ./encodeHLA.pl $HLA_DATA $OUTPUT.HLA.map > $OUTPUT.HLA.ped
    ---
    >     perl encodeHLA.pl $HLA_DATA $OUTPUT.HLA.map > $OUTPUT.HLA.ped
    100c100
    <     ./HLAtoSequences.pl $HLA_DATA HLA_DICTIONARY_SNPS.txt SNPS > $OUTPUT.SNPS.ped
    ---
    >     perl HLAtoSequences.pl $HLA_DATA HLA_DICTIONARY_SNPS.txt SNPS > $OUTPUT.SNPS.ped
    104c104
    <     ./encodeVariants.pl $OUTPUT.SNPS.ped $OUTPUT.SNPS.map $OUTPUT.SNPS.CODED
    ---
    
    tcsh MakeReference.tcsh HAPMAP_CEU HAPMAP_CEU_HLA.ped HM_CEU_REF plink
    
    tcsh MakeReference.tcsh HAPMAP_CEU HAPMAP_CEU_HLA.ped HM_CEU_REF plink
    Creating reference panel: HM_CEU_REF
    [1] Generating amino acid sequences from HLA types.
    [2] Encoding amino acids positions.
    [3] Encoding HLA alleles.
    [4] Generating DNA sequences from HLA types.
    [5] Encoding SNP positions.
    [6] Extracting founders.
    [7] Merging SNP, HLA, and amino acid datasets.
    Warning: Variants 'SNP_A_30018350' and 'AA_A_-11_30018350' have the same
    position.
    Warning: Variants 'SNP_A_30018461_G' and 'SNP_A_30018461_A' have the same
    position.
    Warning: Variants 'SNP_A_30018461_T' and 'SNP_A_30018461_G' have the same
    position.
    987 more same-position warnings: see log file.
    [8] Performing quality control.
    [9] Preparing files for Beagle.
    [10] Converting to beagle format.
    [11] Phasing reference using Beagle (see progress in HM_CEU_REF.bgl.log).
    [12] Done.
    

  2. HATK (HLA Analysis Toolkit)

    Web: https://github.com/WansonChoi/HATK (docs)

    Installation

    git clone https://github.com/WansonChoi/HATK
    

    Example

    Scripts extracted from the documentation example is hatk.sh,

    In particular,

    python3 HATK.py \
            --variants example/wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18 \
            --hped example/wtccc_filtered_58C_RA.hatk.300+300.hped \
            --2field \
            --pheno example/wtccc_filtered_58C_RA.hatk.300+300.phe \
            --pheno-name RA \
            --out work/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18 \
            --imgt 3320 \
            --hg 18 \
            --imgt-dir example/IMGTHLA3320 \
            --multiprocess 2
    

    where --variants reads in the genotype files and --hped the .hped file to be followed by specification of the RA phenotype in a logistic regression. Note that the example is more desirable compared to the toy data in SNP2HLA given its 600 samples and 29,373 variants.

    We observe screen output as follows,

    Namespace(Ggroup=False, HLA=None, NoCaption=False, Pgroup=False, aa=None, ar=None, bmarkergenerator=False, chped=None, condition=None, condition_list=None, covar=None, covar_name=None, dict_AA=None, dict_SNPS=None, fam=None, fourF=False, hat=None, heatmap=False, hg='18', hla2hped=False, hped='example/wtccc_filtered_58C_RA.hatk.300+300.hped', imgt='3320', imgt2seq=False, imgt_dir='example/IMGTHLA3320', input=None, leave_NotFound=False, logistic=False, manhattan=False, maptable=None, metaanalysis=False, multiprocess=2, no_indel=False, nomencleaner=False, omnibus=False, oneF=False, out='MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18', phased=None, pheno='example/wtccc_filtered_58C_RA.hatk.300+300.phe', pheno_name='RA', platform=None, point_color='#778899', point_size='15', reference_allele=None, rhped=None, s1_bim=None, s1_logistic_result=None, s2_bim=None, s2_logistic_result=None, save_intermediates=False, threeF=False, top_color='#FF0000', twoF=True, variants='example/wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18', yaxis_unit='10')
    
    [IMGT2Seq.py]: Multiprocessing.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA A.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA B.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA C.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA DPA1.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA DPB1.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA DQA1.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA DQB1.
    
    [ProcessIMGT.py]: Generating sequence information dictionary for HLA DRB1.
    
    [HLA_Study.py]: IMGT2Seq result :
    < IMGT2Sequence(Newly generated.) >
    - HLA Amino Acids : MyHLAStudy/HLA_DICTIONARY_AA.hg18.imgt3320
    - HLA SNPs : MyHLAStudy/HLA_DICTIONARY_SNPS.hg18.imgt3320
    - HLA Allele Table : MyHLAStudy/HLA_ALLELE_TABLE.imgt3320.hat
    - Maptables for heatmap :
       A   : MyHLAStudy/HLA_MAPTABLE_A.hg18.imgt3320.txt
       B   : MyHLAStudy/HLA_MAPTABLE_B.hg18.imgt3320.txt
       C   : MyHLAStudy/HLA_MAPTABLE_C.hg18.imgt3320.txt
       DPA1: MyHLAStudy/HLA_MAPTABLE_DPA1.hg18.imgt3320.txt
       DPB1: MyHLAStudy/HLA_MAPTABLE_DPB1.hg18.imgt3320.txt
       DQA1: MyHLAStudy/HLA_MAPTABLE_DQA1.hg18.imgt3320.txt
       DQB1: MyHLAStudy/HLA_MAPTABLE_DQB1.hg18.imgt3320.txt
       DRB1: MyHLAStudy/HLA_MAPTABLE_DRB1.hg18.imgt3320.txt
    
    
    [HLA_Study.py]: Given HPED file('example/wtccc_filtered_58C_RA.hatk.300+300.hped') is to be processed by NomenCleaner.
    
    [NomenCleaner.py]: Generating CHPED with Maximum 2 fields HLA alleles.
    
    [bMarkerGenerator.py]: Making Reference Panel for "MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18"
    
    [1] Generating Amino acid(AA)sequences from HLA types.
    [2] Encoding Amino acids positions.
    [3] Encoding HLA alleles.
    [4] Generating DNA(SNPS) sequences from HLA types.
    [5] Encoding SNP positions.
    [6] Extracting founders.
    [7] Merging SNP, HLA, and amino acid datasets.
    Warning: Variants 'HLA_A*02:01' and 'HLA_A*01:01' have the same position.
    Warning: Variants 'HLA_A*02:02' and 'HLA_A*02:01' have the same position.
    Warning: Variants 'HLA_A*02:05' and 'HLA_A*02:02' have the same position.
    2924 more same-position warnings: see log file.
    [8] Performing quality control.
    [9] Making reference panel for HLA-AA,SNPS,HLA and Normal variants(SNPs) is Done!
    
    [HLA_Study.py]: bMarkerGenerator result(Prefix) :
    MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18
    
    [HLA_Study.py]: Logistic Regression result :
    MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.assoc.logistic
    
    [HLA_Study.py]: Manhattan result :
    MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.manhattan.pdf
    
    [HLA_Study.py]: Heatmap result :
     A : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_A.heatmap.pdf
     B : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_B.heatmap.pdf
     C : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_C.heatmap.pdf
     DPA1 : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_DPA1.heatmap.pdf
     DPB1 : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_DPB1.heatmap.pdf
     DQA1 : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_DQA1.heatmap.pdf
     DQB1 : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_DQB1.heatmap.pdf
     DRB1 : MyHLAStudy/RESULT_EXAMPLE_wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18.HLA_DRB1.heatmap.pdf
    

    We proceed with the imputation with the 1000Genomes panel provided with CookHLA on a real project,

    export caprion=~/Caprion/analysis/HLA/CookHLA
    export hatk=${HPC_WORK}/HATK
    export cookhla=${HPC_WORK}/CookHLA
    
    cd ${cookhla}
    python CookHLA.py \
           -i ${hatk}/example/wtccc_filtered_58C_RA.hatk.300+300.chr6.hg18 \
           -hg 18 \
           -o ${hatk}/work/hla_CookHLA \
           -ref ${cookhla}/1000G_REF/1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC \
           -gm ${caprion}/hla_IMPUTED.mach_step.avg.clpsB \
           -ae ${caprion}/hla_IMPUTED.aver.erate \
           -mem 20g \
           -mp 8
    cd -
    

  3. HLA-TAPAS (HLA-Typing At Protein for Association Studies)

    Web: https://github.com/immunogenomics/HLA-TAPAS

    Installation

    git clone https://github.com/immunogenomics/HLA-TAPAS
    

    The association tests require R packages, argparse, multidplyr, rcompanion, and their dependencies. The OMNIBUS analysis should be OMNIBUS_LOGISTIC or OMNIBUS_LINEAR instead.

    Example

    python HLA-TAPAS.py \
           --target example/Case+Control.300+300.chr6.hg18 \
           --reference example/1000G.EUR.chr6.hg18.28mb-35mb \
           --hped-Ggroup example/1000G.EUR.Ggroup.hped \
           --pheno example/Case+Control.300+300.phe \
           --hg 18 \
           --out MyHLA-TAPAS/Case+Control+1000G_EUR_REF \
           --mem 4g \
           --nthreads 4
    

    A collection of documentation example is hla-tapas.sb.

  4. HIBAG

    Web: https://hibag.s3.amazonaws.com/index.html (Bioconductor)

    Installation

    Rscript -e 'BiocManager:install("HIBAG")'
    

    Examples

    See Amazon but most are available from GitHub.