Software
ceuadmin
See https://cambridge-ceu.github.io/csd3/systems/ceuadmin.html for additional information.
gcc
It is one of the critical software to use, e.g.,
module avail gcc
gcc --version
gfortran
gfortran --version
ghc
The Glasgow Haskell Compiler is seen from module avail ghc
, e.g.,
module load ghc/8.2.2
ghc --version
git
The popular git can be loaded,
git --help
git add --help
Go
It is avail from /usr/bin/go
and also visiable from module avail go
.
JAVA
module avail openjdk
java -version
Julia
The Julia compiler is visible from module avail julia
, and by default it loads 1.6.2
module load gcc/9 julia
julia --version
Here is the "hello, world!" session,
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.2 (2021-07-14)
_/ |\__'_|_|_|\__'_| |
|__/ |
julia> println("Hello World")
Hello World
julia>
libraOffice
Official website: https://www.libreoffice.org/
The executables (oocalc
, ooffice
, ooimpress
, oomath
, ooviewdoc
, oowriter
) are in the /usr/bin
directory and can be conveniently called from the console, e.g.,
oowriter README.docx
to load the Word document.
matlab
Official website: https://www.mathworks.com/products/matlab.html.
module avail matlab
module load matlab/r2019b
followed by matlab
.
MySQL
One could access databases elsewhere, e.g., at UCSC – see examples on VEP.
There isn't any MySQL cluster running as a general service on CSD3. Do you believe your group has something running on a VM hosted on our network possibly? If you need a database for your work, running it in your own department and then allowing access to it from CSD3. Databases are not suitable candidates to run on a HPC cluster, the resource requirements are different and by definition they need to be running continuously whilst access is required, so wouldn't be run via SLURM for example.
pspp
Official website: https://www.gnu.org/software/pspp/
module load ceuadmin/pspp
with command-line tool pspp
and a GUI counterpart psppire
.
Python
Official website: https://www.python.org/.
This can be invoked from a CSD3 console via python
and python3
. Libraries can be installed via pip
and pip3
(or equivalently python -m pip install
and python3 -m pip install
), e.g., the script
pip install mygene --user
pip install tensorflow --user
pip install keras --user
pip install jupyter --user
installs libraries at $HOME/.local
.
It is advised to use virual environments, i.e.,
# inherit system-wide packages as well
module load python/3.5
virtualenv --system-site-packages py35
source py35/bin/activate
# pip new packages
deactivate
An alternative syntax is python3 -m venv py37
Note that when this is set up, one only needs to restart from the source
command. The pip
is appropriate for installing small number of package; otherwise Anaconda (https://www.anaconda.com/) and Jupyter notebook (https://jupyter.org/) are useful.
We first load Anaconda and create virtual environments,
module avail miniconda
module load miniconda/2
conda create -n py27 python=2.7 ipykernel
source activate py27
for Python 2.7 at /home/$USER/.conda/envs/py27
, where envs could be replaced with the --prefix
option. These are only required once.
We can also load Anaconda and activate Python 3.51,
module load miniconda/3
conda create -n py35 python=3.5 ipykernel
source activate py35
and follow Autoencoder in Keras tutorial on data from http://yann.lecun.com/exdb/mnist/
The Jupyter notebook can be started as follows,
hostname
$HOME/.local/bin/jupyter notebook --ip=127.0.0.1 --no-browser --port 8081
If it fails to assign the port number, let the system choose (by dropping the --port
option). The process which use the port can be shown with lsof -i:8081
or stopped by lsof -ti:8081 | xargs kill -9
. The command hostname
gives the name of the node. Once the port number is assigned, it is
used by another ssh session elsewhere and the URL generated openable from a browser, e.g.,
ssh -4 -L 8081:127.0.0.1:8081 -fN <hostname>.hpc.cam.ac.uk
firefox <URL> &
paying attention to the port number as it may change.
An hello world
example is hello.ipynb from which hello.html and hello.pdf were generated with jupyter nbconvert --to html|pdf hello.ipynb
.
See also https://docs.hpc.cam.ac.uk/hpc/software-tools/python.html#using-anaconda-python and https://docs.hpc.cam.ac.uk/hpc/software-packages/jupyter.html.
See HPC docuementation for additional information on PyTorch, Tensorflow and GPU.
Introduction to HPC in Python (GitHub).
R
Official website: https://www.r-project.org/ and also https://bioconductor.org/.
Under HPC, the default version is 3.3.3 with /usr/bin/R; alternatively choose the desired version of R from
module avail R
module avail r
# if you would also like to use RStudio
module avail rstudio
e.g., module load r-3.6.0-gcc-5.4.0-bzuuksv rstudio/1.1.383
.
For information about Bioconductor installation, see https://bioconductor.org/install/.
The following code installs package for weighted correlation network analysis (WGCNA).
# from CRAN
dependencies <- c("matrixStats", "Hmisc", "splines", "foreach", "doParallel",
"fastcluster", "dynamicTreeCut", "survival")
install.packages(dependencies)
# from Bioconductor
biocPackages <- c("GO.db", "preprocessCore", "impute", "AnnotationDbi")
# R < 3.5.0
source("http://bioconductor.org/biocLite.R")
biocLite(biocPackages)
# R >= 3.5.0
install.packages("BiocManager")
BiocManager::install(biocPackages)
install.packages("WGCNA")
A good alternative is to use remotes
or devtools
package, e.g.,
remotes::install_bioc("snpStats")
A separate example is from r-forge, e.g.,
rforge <- "http://r-forge.r-project.org"
install.packages("estimate", repos=rforge, dependencies=TRUE)
In case of difficulty it is still useful to install directly, e.g.,
wget http://master.bioconductor.org/packages//2.10/bioc/src/contrib/ontoCAT_1.8.0.tar.gz
R CMD INSTALL ontoCAT_1.8.0.tar.gz
# Alternatively,
R -e "install.packages('ontoCAT_1.8.0.tar.gz',repos=NULL)"
The package installation directory can be spefied explicitly with R_LIBS, i.e.,
export R_LIBS=/rds/user/$USER/hpc-work/R:/rds/user/$USER/hpc-work/R-3.6.1/library
To upgrade Bioconductor, we can specify as follows,
BiocManager::install(version = "3.14")
ruby
module load ceuadmin/ruby
ruby --version
SLURM
Official website: https://slurm.schedmd.com/.
Location at csd3: /usr/local/Cluster-Docs/SLURM/
.
Account details
mybalance
Note that after software updates on 26/4/2022, this command only works on non-login nodes such as icelake.
Partition
scontrol show partition
An interacive job
sintr -A MYPROJECT -p skylake -N2 -n2 -t 1:0:0 --qos=INTR
and also
srun -N1 -n1 -c4 -p skylake-himem -t 12:0:0 --pty bash -i
then check with squeue -u $USER
, qstat -u $USER
and sacct
. The directory /usr/local/software/slurm/current/bin/
contains all the executables while sample scripts are in /usr/local/Cluster-Docs/SLURM
, e.g., template for Skylake.
NOTE the skylakes are approaching end of life, see https://docs.hpc.cam.ac.uk/hpc/user-guide/cclake.html and https://docs.hpc.cam.ac.uk/hpc/user-guide/icelake.html. For Ampere GPU, see https://docs.hpc.cam.ac.uk/hpc/user-guide/a100.html.
Start a job at a specific time
This is achieved with the -b
or --begin
option, e.g.,
sbatch --begin=now+3hour A1BG.sb
Holding and releasing jobs
Suppose a job with id 59230836 is running, they can be achieved with,
scontrol hold 59230836
control release 59230836
respectively.
Monitoring jobs
This is done with squeue
command as above.
Use of modules
The following is part of a real implementation.
. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load gcc/6
module load aria2-1.33.1-gcc-5.4.0-r36jubs
An example
To convert a large number of PDF files (INTERVAL.*.manhattn.pdf) to PNG with smaller file sizes. To start, we build a file list, and pipe into parallel
.
ls *pdf | \
sed 's/INTERVAL.//g;s/.manhattan.pdf//g' | \
parallel -j8 -C' ' '
echo {}
pdftopng -r 300 INTERVAL.{}.manhattan.pdf
mv {}-000001.png INTERVAL.{}.png
'
which is equivalent to SLURM implementation using job arrays (https://slurm.schedmd.com/job_array.html).
#!/usr/bin/bash
#SBATCH --ntasks=1
#SBATCH --job-name=pdftopng
#SBATCH --time=6:00:00
#SBATCH --cpus-per-task=8
#SBATCH --partition=skylake
#SBATCH --array=1-50%10
#SBATCH --output=pdftopng_%A_%a.out
#SBATCH --error=pdftopng_%A_%a.err
#SBATCH --export ALL
export p=$(awk 'NR==ENVIRON["SLURM_ARRAY_TASK_ID"]' INTERVAL.list)
export TMPDIR=/rds/user/$USER/hpc-work/
echo ${p}
pdftopng -r 300 INTERVAL.${p}.manhattan.pdf ${p}
mv ${p}-000001.png INTERVAL.${p}.png
invoked by sbatch
. As with Cardio, it is helpful to set a temporary directory, i.e.,
export TMPDIR=/rds/user/$USER/hpc-work/
Neither parallel
nor SLURM
The following script moves all files a day earlier to directory old/,
find . -mtime +1 | xargs -l -I {} mv {} old
while the code below downloads the SCALLOP-cvd1 sumstats for proteins listed in cvd1.txt
.
export url=https://zenodo.org/record/2615265/files/
if [ ! -d ~/rds/results/public/proteomics/scallop-cvd1 ]; then mkdir ~/rds/results/public/proteomics/scallop-cvd1; fi
cat cvd1.txt | xargs -I {} bash -c "wget ${url}/{}.txt.gz -O ~/rds/results/public/proteomics/scallop-cvd1/{}.txt.gz"
# ln -s ~/rds/results/public/proteomics/scallop-cvd1
Trouble shooting
With error message
squeue: error: _parse_next_key: Parsing error at unrecognized key:
InteractiveStepOptions
squeue: error: Parse error in file
/usr/local/software/slurm/slurm-20.11.4/etc/slurm.conf line 22:
"InteractiveStepOptions="--pty --preserve-env --mpi=none $SHELL""
squeue: fatal: Unable to process configuration file
then either log out and login again, or
unset SLURM_CONF
Stata
Official website: https://www.stata.com/.
As a CEU member the following is possible,
module load ceuadmin/stata/14
as with ceuadmin/stata/15
. The meta-analysis (metan) and Mendelian Randomisation (mrrobust) packages can be installed as follows,
ssc install metan
net install mrrobust, from("https://raw.github.com/remlapmot/mrrobust/master/") replace
-
The following error
CustomValidationError: Parameter channel_priority = 'flexible' declared in <<merged>> is invalid. The value 'flexible' cannot be boolified.
can be resolved with
conda config --set channel_priority false
. ↩