sra-tools
Web: https://github.com/ncbi/sra-tools
ncbi-vdb
The installation is preceeded with module load gcc/6 flex-2.6.4-gcc-5.4.0-2u2fgon
. Although configure
is provided, cmake
is used instead.
cd $CEUADMIN
wget -qO- https://github.com/ncbi/ncbi-vdb/archive/refs/tags/3.0.8.tar.gz | tar xvfz -
mv ncbi-vdb-3.0.8 3.0.8
cd 3.0.8/build
cmake -DCMAKE_PREFIX_PATH=$CEUADMIN/ncbi-vdb/3.0.8 -DCMAKE_INSTALL_PREFIX=$CEUADMIN/ncbi-vdb/3.0.8 ..
make
make install
sra-tools
First, create a symbolic link for ncbi-vdb/3.0.8
in the parent directory: ln -s ../ncbi-vdb/3.0.8/ ncbi-3.0.8
since both are within $CEUADMIN
.
Drop constexpr
as in constexpr size_type max_size() const { return SIZE_MAX; }
in line 161 of the following header file:
/usr/local/Cluster-Apps/ceuadmin/sra-tools/3.0.8/tools/external/driver-tool/util.hpp
.
Now we proceed similarly to ncbi-vdb
above.
cd $CEUADMIN
wget -qO- https://github.com/ncbi/sra-tools/archive/refs/tags/3.0.8.tar.gz | tar xvfz -
mv sra-tools-3.0.8 3.0.8
cd 3.0.8/build
cmake -DVDB_LIBDIR=$CEUADMIN/ncbi-vdb/3.0.8/lib64 -DCMAKE_INSTALL_PREFIX=$CEUADMIN/sra-tools/3.0.8 ..
make
make install
modules
This is available from module load ceuadmin/sra-tools
Application
We have gastric.list
with two records,
SRR8244777
SRR8244854
Our SLURM script is named gastric.sb
as follows,
#!/usr/bin/bash
#SBATCH --job-name gastric
#SBATCH --account PETERS-SL3-CPU
#SBATCH --partition icelake-himem
#SBATCH --array=1-2
#SBATCH --time=10:00:00
#SBATCH --mail-type=NONE
#SBATCH --output=/rds/project/jmmh2/rds-jmmh2-public_databases/CPTAC/TEMP/_gastric_%A_%a.o
#SBATCH --error=/rds/project/jmmh2/rds-jmmh2-public_databases/CPTAC/TEMP/_gastric_%A_%a.e
. /etc/profile.d/modules.sh
module purge
module load rhel8/default-icl
module load ceuadmin/sra-tools/3.0.8
TMPDIR=~/rds/rds-jmmh2-public_databases/CPTAC/TEMP/
destdir=~/rds/rds-jmmh2-public_databases/CPTAC/TEMP/gastric_Korea_2019/SRA_PRJNA505380/
accession=$(awk -v n=${SLURM_ARRAY_TASK_ID} 'NR==n' gastric.list)
application="fasterq-dump"
options="-t ${TMPDIR} -O ${destdir} ${accession}"
cd $TMPDIR
echo -e "Changed directory from $SLURM_SUBMIT_DIR to $TMPDIR.\n"
CMD="$application $options"
eval $CMD
echo -e "JobID: $SLURM_JOB_ID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
if [ "$SLURM_JOB_NODELIST" ]; then
#! Create a machine file:
export NODEFILE=`generate_pbs_nodefile $SLURM_JOB_NODELIST`
cat $NODEFILE | uniq > machine.file.$SLURM_ARRAY_TASK_ID
echo -e "\nNodes allocated:\n================"
cat machine.file.$SLURM_ARRAY_TASK_ID | sed -e 's/\..*$//g'
fi
echo -e "\nExecuting command:\n==================\n$CMD\n"
cd -
and submitted as sbatch gastric.sb
.
By default, the download is split into three parts which can be combined as follows,
fasterq-dump <accession> --concatenate-reads --include-technical
See also https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump.
Additional notes
Call to fasterq-dump
in the SLURM script above could produce error as follows,
Loading rhel8/default-icl
Loading requirement: dot rhel8/slurm singularity/current rhel8/global
cuda/11.4 vgl/2.5.1/64 intel-oneapi-compilers/2022.1.0/gcc/b6zld2mz
intel-oneapi-mpi/2021.6.0/intel/guxuvcpm
Loading ceuadmin/sra-tools/3.0.8
Loading requirement: gcc/6 flex-2.6.4-gcc-5.4.0-2u2fgon
2023-10-10T15:52:30 sratools.3.0.8 err: libs/vfs/names4-response.c:2293:Response4StatusInit: error unexpected while resolving query within virtual file system module - No accession to process ( 500 )
Failed to call external services.
Use of prefetch
nevertheless gives similar error.
It appears to be an issue with SLURM, for vdb-dump <accession>
confirms availability of the accessions and prefetch <accession>
works from an interactive Linux session.
Consequently, we resort to GNU parallel
as follows,
#!/usr/bin/bash
module load perl-5.20.0-gcc-5.4.0-4npvg5p
module load ceuadmin/sra-tools
export TMPDIR=~/rds/rds-jmmh2-public_databases/CPTAC/TEMP
export cwd=${PWD}
cd ${TMPDIR}/prefetch
cat ${cwd}/gastric.list | \
parallel -C' ' -j5 '
export accession={}
(
vdb-dump ${accession} --info
prefetch --force ALL --transport http --max-size u --progress ${accession}
) > ${accession}.log
'
cd -
Now the issue with Perl
also goes away under cclake-himem
, and fasterq-dump
can be used to extract the .sra
file for each accession after remote access is disabled from vdb-config -i
.