Parallel Computing
GNU Parallel
Web: https://www.gnu.org/software/parallel/
An example
To convert a large number of PDF files (INTERVAL.*.manhattn.pdf) to PNG with smaller file sizes. To start, we build a file list, and pipe into parallel
.
ls *pdf | \
sed 's/INTERVAL.//g;s/.manhattan.pdf//g' | \
parallel -j8 -C' ' '
echo {}
pdftopng -r 300 INTERVAL.{}.manhattan.pdf
mv {}-000001.png INTERVAL.{}.png
'
Additional note is worthwhile about Bash function, as demonstrated by the following script,
function turboman()
{
R --slave --vanilla --args \
input_data_path=${phenotype}.txt.gz \
output_data_rootname=${phenotype} \
custom_peak_annotation_file_path=${phenotype}.annotate \
reference_file_path=turboman_hg19_reference_data.rda \
pvalue_sign=1e-2 \
plot_title="${phenotype}" < turboman.r
}
export -f turboman
parallel -C' ' -j4 --env _ '
echo {}
export phenotype={}
turboman
' ::: chronotype sleep_duration insomnia snoring
where function turboman
is exported and called by parallel
. The --env _
options copies exported all variables except those in ~/.parallel/ignored_vars
, while env_parallel
would copy all export/non-exported variables.
Note that the two options are preceded with parallel --record-env
and env_parallel --install
, respectively.
SLURM
Official website: https://slurm.schedmd.com/.
CSD3 User guide, https://docs.hpc.cam.ac.uk/hpc/user-guide/batch.html
Location at csd3: /usr/local/Cluster-Docs/SLURM/
.
The directory /usr/local/software/slurm/current/bin/
contains all the executables..
Account details
mybalance
Note that after software updates on 26/4/2022, this command only works on non-login nodes such as icelake.
Partition
scontrol show partition
The skylakes have been decommissioned, https://docs.hpc.cam.ac.uk/hpc/user-guide/cclake.html and https://docs.hpc.cam.ac.uk/hpc/user-guide/icelake.html. For Ampere GPU, see https://docs.hpc.cam.ac.uk/hpc/user-guide/a100.html.
An interactive job
CSD3 user guide for an interactive session, https://docs.hpc.cam.ac.uk/hpc/user-guide/interactive.html
sintr -A MYPROJECT -p skylake -N2 -n2 -t 1:0:0 --qos=INTR
and also
srun -N1 -n1 -c4 -p cclake-himem -t 12:0:0 --pty bash -i
Batch job
A batch job is invoked by sbatch
.
Sample scripts are in /usr/local/Cluster-Docs/SLURM
, e.g., template for Skylake.
Starting a job at a specific time
This is achieved with the -b
or --begin
option, e.g.,
sbatch --begin=now+3hour A1BG.sb
Holding and releasing jobs
Suppose a job with id 59230836 is running, they can be achieved with,
scontrol hold 59230836
control release 59230836
respectively.
Monitoring jobs
This is done with squeue
command.
The load of a specific partition can be checked with squeue -p <partition name>
.
For $USER
, check with squeue -u $USER
, qstat -u $USER
and sacct
.
Using modules
The following is part of a real project.
. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load gcc/6
module load aria2-1.33.1-gcc-5.4.0-r36jubs
For cclake, we have
. /etc/profile.d/modules.sh
module purge
module load rhel7/default-ccl
For icelake, we have
. /etc/profile.d/modules.sh
module purge
module load rhel8/default-icl
Temporary directory
Although it is less apparent with a single run, SLURM jobs tend to use large temporary space which can easily be beyond the system default.
The following statement sets a temporary directory, i.e.,
export TMPDIR=/rds/user/$USER/hpc-work/
Trouble shooting
With error message
squeue: error: _parse_next_key: Parsing error at unrecognized key:
InteractiveStepOptions
squeue: error: Parse error in file
/usr/local/software/slurm/slurm-20.11.4/etc/slurm.conf line 22:
"InteractiveStepOptions="--pty --preserve-env --mpi=none $SHELL""
squeue: fatal: Unable to process configuration file
then either log out and login again, or
unset SLURM_CONF
An example
The example in GNU Parallel is turned to SLURM implementation using job arrays (https://slurm.schedmd.com/job_array.html).
#!/usr/bin/bash
#SBATCH --ntasks=1
#SBATCH --job-name=pdftopng
#SBATCH --time=6:00:00
#SBATCH --cpus-per-task=8
#SBATCH --partition=skylake
#SBATCH --array=1-50%10
#SBATCH --output=pdftopng_%A_%a.out
#SBATCH --error=pdftopng_%A_%a.err
#SBATCH --export ALL
export p=$(awk 'NR==ENVIRON["SLURM_ARRAY_TASK_ID"]' INTERVAL.list)
export TMPDIR=/rds/user/$USER/hpc-work/
echo ${p}
pdftopng -r 300 INTERVAL.${p}.manhattan.pdf ${p}
mv ${p}-000001.png INTERVAL.${p}.png
To embed SLURM call in a bash script, one can use sbatch --wait <SLURM scripts>
. SLURM scripts can also be inside the Bash counterpart.
Other approaches
The following script moves all files a day earlier to directory old/,
find . -mtime +1 | xargs -l -I {} mv {} old
while the code below downloads the SCALLOP-cvd1 sumstats for proteins listed in cvd1.txt
.
export url=https://zenodo.org/record/2615265/files/
if [ ! -d ~/rds/results/public/proteomics/scallop-cvd1 ]; then mkdir ~/rds/results/public/proteomics/scallop-cvd1; fi
cat cvd1.txt | xargs -I {} bash -c "wget ${url}/{}.txt.gz -O ~/rds/results/public/proteomics/scallop-cvd1/{}.txt.gz"
# ln -s ~/rds/results/public/proteomics/scallop-cvd1
The following example illustrates job canceling with status "PD" but leaving those running untouched,
$ squeue -u jhz22
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
41800594_[1-4] cclake-hi _3-CADM1 jhz22 PD 0:00 1 (None)
41800593_[1-11] cclake-hi _3-CADH6 jhz22 PD 0:00 1 (None)
41800592_[1-44] cclake-hi _3-CADH5 jhz22 PD 0:00 1 (None)
41800591_[1-13] cclake-hi _3-CADH1 jhz22 PD 0:00 1 (None)
41800590_[1-2] cclake-hi _3-CAD19 jhz22 PD 0:00 1 (None)
41800589_[1-19] cclake-hi _3-CA2D1 jhz22 PD 0:00 1 (None)
41800588_[1-43] cclake-hi _3-C4BPA jhz22 PD 0:00 1 (None)
41800587_[1-113] cclake-hi _3-C1S jhz22 PD 0:00 1 (None)
41800586_[1-13] cclake-hi _3-C1RL jhz22 PD 0:00 1 (None)
41800585_[1-90] cclake-hi _3-C1R jhz22 PD 0:00 1 (None)
41800584_[1-3] cclake-hi _3-C1QR1 jhz22 PD 0:00 1 (None)
41800583_[1-23] cclake-hi _3-C1QC jhz22 PD 0:00 1 (None)
41800582_[1-41] cclake-hi _3-C1QB jhz22 PD 0:00 1 (None)
41800581_[1-5] cclake-hi _3-BTK jhz22 PD 0:00 1 (None)
41800580_[1-11] cclake-hi _3-BST1 jhz22 PD 0:00 1 (None)
41800344_[1] cclake-hi _3-AMYP jhz22 PD 0:00 1 (Priority)
41800236_51 cclake-hi _3-ALS jhz22 R 0:31 1 cpu-p-198
41800236_53 cclake-hi _3-ALS jhz22 R 0:31 1 cpu-p-597
41800337_12 cclake-hi _3-AMBP jhz22 R 0:31 1 cpu-p-490
41800337_14 cclake-hi _3-AMBP jhz22 R 0:31 1 cpu-p-490
41800338_3 cclake-hi _3-AMD jhz22 R 0:31 1 cpu-p-251
41800162_1 cclake-hi _3-AGRG6 jhz22 R 1:14 1 cpu-p-251
41800059_160 cclake-hi _3-AACT jhz22 R 1:34 1 cpu-p-417
41798125_1 cclake-hi _3-AGRG6 jhz22 R 7:19 1 cpu-p-418
41797610_160 cclake-hi _3-AACT jhz22 R 7:52 1 cpu-p-597
41747018_214 cclake-hi _3-ITIH2 jhz22 R 4:07:36 1 cpu-p-245
41705768_214 cclake-hi _3-ITIH2 jhz22 R 11:18:49 1 cpu-p-245
or squeue -u $USER --state=suspend -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
.
We also use xargs
,
squeue -u jhz22 | grep PD | awk '{print $1}' | xargs -l -I {} scancel {}
To concel jobs on a specific partition, use -p <partition-name>
.