hail
Web: https://hail.is/ as with hail on cloud.
Preparations
We will need to set up an virtual environment as follows,
module load python/3.7 hadoop/2.7.7
virtualenv py37
source py37/bin/activate
Later, only two of the commands are necessary, i.e.,
module load python/3.7 hadoop/2.7.7
source py37/bin/activate
Installation
We proceed with,
pip install hail
pip install gnomAD
pip install pyensembl
pip install varcode
pyensembl install --release 99 --species human
where gnomAD
, pyensembl
(https://github.com/openvax/pyensembl) and varcode
(https://github.com/openvax/varcode) are optional, and we could invoke Python as follows,
python
>>> import hail as hl
>>> mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
>>> mt.describe()
>>> mt.summarize()
>>> mt.count()
>>> mt.show()
>>> mt.write("hail")
>>> at = hl.import_vcf("INTERVAL.vcf.bgz")
>>> hl.export_vcf(at,"at.vcf.bgz")
Alternatively, we could also start a session using ipython
.
Tutorials
We can get the tutorials at the current directory,
wget -qO- https://hail.is/docs/0.2/tutorials.tar.gz | tar xf -
and cut/paste code from the hail website to our command-line session. A better quality can be achieved with a browser, which unfortunately does not function so well on csd3 and requires some tweak from elsewhere,
First, from a csd3 session we issue commands,
hostname
jupyter notebook tutorials/ --ip=127.0.0.1 --no-browser --port 8081
and for this instance we have hostname
as login-e-12
and additional information as follows,
[I 20:21:59.258 NotebookApp] Serving notebooks from local directory: /rds/project/jmmh2/rds-jmmh2-results/public/gwas/ukb_exomes/tutorials
[I 20:21:59.258 NotebookApp] The Jupyter Notebook is running at:
[I 20:21:59.258 NotebookApp] http://127.0.0.1:8081/?token=2d4ba02d3d0d782ad1dbafc73f96a3fc3d0bdff4573899c9
[I 20:21:59.258 NotebookApp] or http://127.0.0.1:8081/?token=2d4ba02d3d0d782ad1dbafc73f96a3fc3d0bdff4573899c9
[I 20:21:59.258 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 20:21:59.300 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/jhz22/.local/share/jupyter/runtime/nbserver-235186-open.html
Or copy and paste one of these URLs:
http://127.0.0.1:8081/?token=2d4ba02d3d0d782ad1dbafc73f96a3fc3d0bdff4573899c9
or http://127.0.0.1:8081/?token=2d4ba02d3d0d782ad1dbafc73f96a3fc3d0bdff4573899c9
from a console other than a csd3 session (e.g., from a srcf
or ds
or Windows virtualbox
virtual machine) we access the csd3 session above.
ssh -4 -L 8081:127.0.0.1:8081 -fN login-e-12.hpc.cam.ac.uk
firefox http://127.0.0.1:8081/?token=2d4ba02d3d0d782ad1dbafc73f96a3fc3d0bdff4573899c9
and run the tutorial from firefox. Similar process is also described at the system/software section on Python.
For convenience, we list the code from the tutorials on 1000Genomes data, see https://notebook.community/danking/hail/notebook/images/hail/resources/Hail-Workshop-Notebook.
import hail as hl
import os
dir=os.environ['HPC_WORK']+"/lib64/"
files=dir+"libblas.so:"+dir+"libcblas.so:"+dir+"liblapack.so"
print(files)
# os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
hl.init(spark_conf={"spark.executor.extraClassPath": files})
# Gene information
gene_ht = hl.import_table('tutorials/data/ensembl_gene_annotations.txt', impute=True)
gene_ht.show()
gene_ht.count()
gene_ht = gene_ht.transmute(interval = hl.locus_interval(gene_ht['Chromosome'],
gene_ht['Gene start'],
gene_ht['Gene end'],
reference_genome='GRCh37'))
gene_ht = gene_ht.key_by('interval')
# 1000Genomes data from the tutorials
sa = hl.import_table('tutorials/data/1kg_annotations.txt', impute=True, key='Sample')
sa.describe()
sa.show()
mt = hl.read_matrix_table('tutorials/data/1kg.mt')
mt = mt.annotate_rows(gene_info = gene_ht[mt.locus])
mt.s.show(5)
mt.locus.show(5)
hl.summarize_variants(mt)
mt = mt.annotate_cols(pheno = sa[mt.s])
mt = hl.sample_qc(mt)
mt.sample_qc.describe()
p = hl.plot.scatter(x=mt.sample_qc.r_het_hom_var,
y=mt.sample_qc.call_rate)
show(p)
mt = mt.filter_cols(mt.sample_qc.dp_stats.mean >= 4)
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.97)
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = (
hl.case()
.when(mt.GT.is_hom_ref(), ab <= 0.1)
.when(mt.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
.default(ab >= 0.9) # hom-var
)
mt = mt.filter_entries(filter_condition_ab)
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
mt = hl.variant_qc(mt)
mt.variant_qc.describe()
mt.variant_qc.AF.show()
mt = mt.filter_rows(hl.min(mt.variant_qc.AF) > 1e-6)
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 0.005)
mt = mt.annotate_rows(gene_info = gene_ht[mt.locus])
mt.gene_info.show()
burden_mt = (
mt
.group_rows_by(gene = mt.gene_info['Gene name'])
.aggregate(n_variants = hl.agg.count_where(mt.GT.n_alt_alleles() > 0))
)
burden_mt.describe()
pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)
pca_eigenvalues
pca_scores.describe()
pca_scores.scores[0].show()
pca_loadings.describe()
mt = mt.annotate_cols(pca = pca_scores[mt.s])
p = hl.plot.scatter(mt.pca.scores[0],
mt.pca.scores[1],
label=mt.pheno.SuperPopulation)
show(p)
# errors from here though hl.init() above
# https://discuss.hail.is/t/undefined-symbol-cblas-dgemm/1488
gwas = hl.linear_regression_rows(y=mt.pheno.CaffeineConsumption,
x=mt.GT.n_alt_alleles(),
covariates=[1.0])
gwas.describe()
p = hl.plot.manhattan(gwas.p_value)
show(p)
p = hl.plot.qq(gwas.p_value)
show(p)
gwas = hl.linear_regression_rows(
y=mt.pheno.CaffeineConsumption,
x=mt.GT.n_alt_alleles(),
covariates=[1.0, mt.pheno.isFemale, mt.pca.scores[0], mt.pca.scores[1], mt.pca.scores[2]])```
Resources
Some files can be made available with gsutil
installed, e.g.,
- The HGDP release 3.1,
gs://gcp-public-data--gnomad/release/3.1
.- See also,
gs://gnomad-public/release
.
- See also,
- UK Biobank results,
gs://hail-datasets/ukbb_imputed_v3_gwas_results_both_sexes.GRCh37.mt
. - Pan-ancestry genetic analysis of the UK Biobank, https://pan.ukbb.broadinstitute.org/docs/hail-format (Downloads).
- Exome-based association statistics, see below.
- sparkhail – a ‘Sparklyr' extension for ‘Hail', https://cran.r-project.org/package=sparkhail.
genebass
Download
# genebass and csd3 locations
export src=gs://ukbb-exome-public/300k/results
export dest=~/rds/results/public/gwas/ukb_exomes
mkdir -p ${dest}
cd ${dest}
# Gene burden results
gsutil -m cp -r ${src}/results.mt .
# Single variant association results
gsutil -m cp -r ${src}/variant_results.mt .
Access
We could run the script directly as follows,
cd ~/rds/results/public/gwas/ukb_exomes
python <<END
import hail as hl
# Gene (burden) results
gr = hl.read_matrix_table('results.mt')
## Summaries
gr.count()
gr.describe()
gr.show()
gr.cols().show()
gr.rows().show()
gr.entry.take(5)
gr.summarize()
## filtering
FGR = gr.filter_rows(gr.gene_symbol=="FGR")
FGR1065 = FGR.filter_cols(FGR.coding == '1065')
FGR1065burden = FGR1065.select_entries(FGR1065.BETA_Burden)
## col operations
trait_gr = gr.cols().collect_by_key()
trait_gr.group_by('trait_type').aggregate(n_phenos=hl.agg.count()).show()
gr_cols = gr.cols()
gr_cols.count()
gr_cols.show(truncate=40, width=85)
gr_cols.phenocode.show()
gr_cols.filter(gb_cols.phenocode == '1777').show()
fields_to_drop = ['category', 'coding_description', 'description_more', 'inv_normalized',
'n_cases_both_sexes', 'n_cases_females', 'n_cases_males', 'saige_version']
gr_cols_reduced=gb_cols.drop(*fields_to_drop)
# Variant results
vr = hl.read_matrix_table('variant_results.mt')
hl.summarize_variants(vr)
vr.count()
vr.describe()
vr.select_cols().show(1)
vr.select_rows().show(1)
vr.locus.show(1)
co1777=vr.filter_cols(vr.coding=="1777").show()
co1777.entry
co1777.entry.take(5)
# IL12B
END
Reference
Karczewski, K.J. et al. Systematic single-variant and gene-based association testing of 3,700 phenotypes in 281,850 UK Biobank exomes. medRxiv, 2021.06.19.21259117 (2021).