GWAS Catalog

Web: https://www.ebi.ac.uk/gwas/deposition (doc, email, format)

European Life Science Research Infrastructure Login, Contact: support@aai.lifescience-ri.eu, Homepage: https://lifescience-ri.eu/ls-login/

Like the entry for DNAnexus which is associated with a range of tools for UKBiobank analysis and uses available setup, the collection of software relate to data submission to the GWAS Catalog and involve ceuadmin/snakemake

module load ceuadmin/snakemake

to save space.

I. gwas-sumstats-tools

GitHub: https://github.com/EBISPOT/gwas-sumstats-tools

Installation

pip3 install gwas-sumstats-tools

where we borrow the setup for snakemake associated with Python 3.11.0 that satisfies the requirement (>=3.9.0).

Usage

This is described pragmatically as follows.

gwas-ssf --help

II. Globus

Web: https://www.globus.org/globus-connect-personal (CLI)

wget -qO- https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz | \
tar xvfz -
cd globusconnectpersonal-3.2.2
# ./globusconnectpersonal
./globusconnectpersonal -setup --no-gui
# CLI
pip3 install globus-cli
globus list-commands
globus login
globus whoami
globus session show
globus transfer --help
globus logout

 

where we again use the setup for snakemake.

We carry on building a module so it is enabled with module load ceuadmin/globusconnectpersonal/3.2.2 and could simply run globusconnect as well as globus.

It is desirable to use a web browser, whose close counterpart on CSD3 is from the ceuadmin/Cytoscape/3.9.1 module.

III. Application: SCALLOP-INF sumstats submission

Web: https://jinghuazhao.github.io/INF/

Reformatting and indexing

The documented example¹ is shown here,

chromosome	base_pair_location	effect_allele	other_allele	beta	standard_error	effect_allele_frequency	p_value	variant_id	rsid
1	869388	A	G	-0.016619	0.00806496	0.997221	0.1	1_869388_A_G	NA
1	205811055	C	T	-0.0089589	0.00331941	0.983589	9.7E-03	1_205811055_C_T	rs74143854
2	70478797	T	TG	0.0187528	0.00167685	0.934121	3.5E-30	2_70478797_T_TG	rs142640435
2	27875036	TAAA	T	-0.0184003	0.00101051	0.78451	5.7E-76	2_27875036_TAAA_T	rs774624803
23	24145170	A	G	0.00387762	0.08757958	0.627178	2.3E-08	23_24145170_A_G	rs5949232

We have a SLURM script,

#!/usr/bin/bash

#SBATCH --job-name=_gwas_catalog
#SBATCH --mem=28800
#SBATCH --time=12:00:00

#SBATCH --account CARDIO-SL0-CPU
#SBATCH --partition cardio
#SBATCH --qos=cardio

#SBATCH --export ALL
#SBATCH --array=1-91
#SBATCH --output=_%A_%a.o
#SBATCH --error=_%A_%a.e

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load ceuadmin/snakemake

export src=/rds/project/jmmh2/rds-jmmh2-projects/olink_proteomics/scallop/INF/METAL
export dst=~/rds/results/public/proteomics/scallop-inf1

if [ ! -f "${dst}/proteins.lst" ]; then
   ls ${src}/*gz | grep -v BDNF | xargs -l -I {} basename {} -1.tbl.gz | sed 's/-/\t/'| cut -f1 > ${dst}/proteins.lst
fi

export protein=$(awk 'NR==ENVIRON["SLURM_ARRAY_TASK_ID"]' ${dst}/proteins.lst)

(
  echo chromosome base_pair_location effect_allele other_allele beta standard_error effect_allele_frequency p_value variant_id rsid n
  zcat ${src}/${protein}-1.tbl.gz | \
  sed '1d' | \
  cut -f1-6,10-12,18 | \
  sort -k3,3 | \
  join - -13 -21 ${INF}/work/INTERVAL.rsid | \
  awk '
    {
      $4=toupper($4)
      $5=toupper($5)
      $1=$2"_"$3"_"$4"_"$5
      if(substr($11,1,2)!="rs") $11="NA"
      print $2,$3,$4,$5,$7,$8,$6,10^$9,$1,$11,int($10)
    }' | \
  sort -k1,1n -k2,2n
) | \
tr ' ' '\t' | \
Rscript -e '
  suppressMessages(library(dplyr))
  pgwas <- read.delim("stdin") %>%
           mutate(p_value=gap::pvalue(beta/standard_error))
  write.table(pgwas,quote=FALSE,row.names=FALSE,sep="\t")
' | \
bgzip -f > ${dst}/${protein}.tsv.gz
tabix -S1 -s1 -b2 -e2 -f ${dst}/${protein}.tsv.gz
gwas-ssf read ${dst}/${protein}.tsv.gz
gwas-ssf validate -e ${dst}/${protein}.tsv.gz

# gunzip -c  $src/4E.BP1-1.tbl.gz  | head -1 | tr '\t' '\n' | awk '{print "#"NR, $1}'
#1 Chromosome
#2 Position
#3 MarkerName
#4 Allele1
#5 Allele2
#6 Freq1
#7 FreqSE
#8 MinFreq
#9 MaxFreq
#10 Effect
#11 StdErr
#12 log(P)
#13 Direction
#14 HetISq
#15 HetChiSq
#16 HetDf
#17 logHetP
#18 N
# head -2 ~/INF/work/INTERVAL.rsid
# chr10:100000051_A_G rs141059932
# chr10:100000056_C_G 10:100000056_C_G
 

A number of proteins including CCL25, CD6, CXCL6, FGF.5, IL.12B, IL.18R1 and TNFB have p_value=0 so their specifical handling with R is introduced as a generic solution. We could obtain the meta-data as required in the submission form,

cd ${dst}
md5sum *gz* > MD5
ls *gz | sed 's/.tsv.gz//' | \
parallel -j10 -C' ' '
  cat <(echo {}) \
      <(gunzip -c {}.tsv.gz | wc -l | cut -d" " -f1) \
      <(grep -w {}.tsv.gz$ MD5 | sed "s/  /\t/") | \
  tr "\n" "\t"
  gunzip -c {}.tsv.gz | sed "1d" | cut -f11 | sort -k1,1nr | head -1
' | \
sort -k1,1 > meta.tsv
cd -
 

which include protein name, number of variants, md5, file name and sample size.

A post-hoc remapping of protein target names to gene symbols can be achieved as follows,

Rscript -e '
  suppressMessages(library(dplyr))
  ids <- pQTLdata::inf1 %>%
         filter(gene!="BDNF") %>%
         select(prot,target.short,gene)
  write.table(ids,col.names=FALSE,row.names=FALSE,quote=FALSE,sep="\t")
' | \
sort -k1,1 | \
join -t$'\t' <(ls ${dst} | grep -v tbi | sed 's/.tsv.gz//' | sort -k1,1) - > prot_target_gene.tsv
 

We are ready to proceed from https://www.ebi.ac.uk/gwas/deposition with globus running and a LS RI profile (e.g., globus file manager, LS RI profile). The submission page shows these steps,

Upload summary statistics file(s) to your Globus submission folder

Download submission form

Fill in submission form (see here for help)

Wait to receive an email confirmation from Globus that all summary statistics files have successfully been transferred

Submit submission form

To remove the current submission form, click "Reset". Use "Review submission" to download the current submission form.

Upon successes, email notifications are given along with study accessions. It is possible to revisit the page via https://www.ebi.ac.uk/gwas/deposition/login.

Hayhurst, J. et al. A community driven GWAS summary statistics standard. bioRxiv, 2022.2007.2015.500230 (2022). ↩