Genomics Compute Cluster User Guide

Genomics Compute Cluster User Guide#

Overview and Purpose#

The Genomics Compute Cluster (GCC) exists to provide the computational resources necessary for cutting-edge genomics research at Northwestern. It includes access to high-performance computing (HPC) resources as well as technical support through trainings and consultations. Users can seek assistance through dedicated technical support channels or connect with peers through a community slack forum.

In addition to this user guide, we provide a Quick Start page and information on the service on the Northwestern IT website .

The GCC is part of Quest, and the guidance in the Quest User Guide is useful for GCC users as well. This page covers details and settings that differ for GCC users as compared to General Access users of Quest, as well as information on genomics related tools and resources provided through the GCC.

Allocation b1042#

The GCC is managed as a Priority Access allocation on Quest, with the allocation name b1042. Allocation b1042 functions like any other allocation in that members have access to compute and GPU nodes through partitions for running jobs, and storage associated with the allocation. If a Quest user is a member of the GCC, b1042 will be listed alongside any other allocations they are a member of when using the groups command.

More information on allocations can be found on the Northwestern Quest allocation types page .

Scratch storage within `/projects/b1042`#

The Genomics Compute Cluster provides researchers with shared scratch storage space in /projects/b1042 for short-term storage. For maximum flexibility, this scratch space is open to all users of the Genomics Compute Cluster, with a 40 TB quota in place for individual researchers.

As this is temporary scratch storage, unused files will expire and may be stored in /projects/b1042 for no longer than one month, after which they will be automatically deleted. The deletion process is outlined in detail below.

Making a Directory#

Members can navigate to /projects/b1042 where they can create a directory for their own use. Use the mkdir command followed by the name of the directory you want to create.

$ cd /projects/b1042
$ mkdir examplefolder

If a directory with the name (examplefolder) exists, you will get a warning message:

mkdir: cannot create directory ‘examplefolder’: File exists

Managing Permissions#

You may use chmod to manage the permissions of your own files and directories within /projects/b1042. This allows you to control whether the group, which is all of b1042, has read, write, or execute permissions for files that you own. Read permissions allow a user to read a file (or use a piece of software to read a file) as well as make a copy of a file. Write permissions allow a user to edit a file. Execute permissions allow a user to execute files that include code or commands. You may add or remove read, write, and execute permissions for the b1042 group on a per file basis, however, you cannot provide only a subset of b1042 members with different permissions.

Checking Storage Use#

To see how much storage you are using in /projects/b1042, use the command b1042check at the Quest command line. This will report how many files and directories you own within /projects/b1042 and how much of the 40 TB quota is being used.

$ b1042check
abc123 currently has 80395 files & directories using 2424 GB of their 40960 GB quota in /projects/b1042

Deletion Process#

All storage space within /projects/b1042 and its subdirectories is scratch, meaning temporary storage where unused files will be regularly deleted. We run an audit process 10 days before the end of each month to identify expired files and email users a notification that their files are expiring. Users whose files were identified in the audit also receive a reminder email 2 days before the end of each month. All files that have not been accessed or modified in the previous 31 days are deleted on the first of every month. We do not back up this, or any other projects directory on Quest, and cannot recover deleted files.

The 10 day file expiry notification email includes a list of files identified for deletion. A zipped version of this file is also saved in each user’s home directory on Quest. You can unzip and view this file as in the example below:

$ unzip b1042_hsc945_list.20240721.zip 
Archive: b1042_hsc945_list.20240721.zip 
inflating: hpc/pipspace/scratch/genomics/b1042_hsc945_list.20240721.txt 

$ cat hpc/pipspace/scratch/genomics/b1042_hsc945_list.20240721.txt

Note that depending on the number of expiring files, the output of cat might be unwieldy and worth using grep to find particular files. This list is not dynamic and will not update as you access, move, or remove the files named within it.

Users may use the stat command to check the access and last modify dates of individual files following the example below:

$ stat examplefile.txt
File: ‘examplefile.txt’
Size: 0         Blocks: 0          IO Block: 8388608 regular empty file
Device: 2bh/43d Inode: 86072091374  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 6144/  hsc945)   Gid: (111111042/   b1042)
Access: 2024-07-29 15:34:46.648804000 -0500
Modify: 2024-07-30 12:05:14.143460931 -0500
Change: 2024-07-30 12:05:14.143460931 -0500
Birth: -

The dates are in year-month-day format and the “Access” and “Modify” dates are the ones used in the deletion process. Access refers to a file being opened, modify refers to the contents of a file changing. stat can also take the wildcard marker * to match patterns and check multiple files at once.

NUSeq Files#

If you have sequencing done by the NUSeq Core , they can deliver your FASTQ files to your directory in /projects/b1042. However, this is scratch space and not permanent file storage. We recommend applying for a General Access Quest allocation and moving your files to storage associated with that allocation where they can be stored beyond the month limit for /projects/b1042.

Running Jobs on the GCC#

Allocation b1042 is not a General Access allocation and therefore cannot be used to submit jobs to the general access partitions. Instead, there are GCC-specific partitions that can be used to submit jobs to the GCC nodes. Which partitions you should use depends on your school affiliation and the resources needed for your job. Available partitions for different schools are listed below.

Partitions For Feinberg and Weinberg Researchers#

Partition Name	Maximum Job Duration	CPUs Per Node	RAM Per Node
genomics	2 days	52 or 64	192 or 256 GB
genomicslong	10 days	52 or 64	192 or 256 GB
genomics-gpu	2 days	52	192 GB
genomics-himem	7 days	64	2 TB
genomics-burst	project-based access: inquire here

Partitions For All Other Researchers#

Partition Name	Maximum Job Duration	CPUs Per Node	RAM Per Node
genomicsguest	2 days	52 or 64	192 or 256 GB
genomicsguestex	10 days or less	52 or 64	192 or 256 GB
genomicsguest-gpu	2 days	52	192 GB
genomicsguest-himem	7 days	64	2 TB

genomicsburst#

For large jobs requiring many nodes, or that run for more than 240 hours, users may request access to the genomicsburst partition by contacting quest-help@northwestern.edu . Before using genomicsburst partition, users must meet with a Research Computing Services consultant to confirm that code has been reviewed for efficiency. It is advised to reach out to schedule this consultation at least three (3) weeks in advance.

Submission Scripts#

To use the GCC resources associated with b1042, you must specify b1042 as the Slurm account and indicate a GCC partition as sbatch directives in your submission script:

#SBATCH -A b1042
#SBATCH -p genomics # or other appropriate genomics partition

You can use absolute file paths to point to data held in other allocations’ projects directories and to direct your job output either to /projects/b1042 or elsewhere on Quest as needed regardless of whether you are submitting your job to a genomics or General Access partition.

To learn more about creating job submission scripts and submitting jobs, see the guide to using Slurm.

To see example submission scripts for many different software on Quest, please see our Example Job Github Repository .

Using GCC GPUs#

There are 4 GPU nodes in the Genomics Compute Cluster. These nodes have driver version 525.105.17 which is compatible with CUDA 12.0 or earlier:

2 nodes which each have 4 x 40GB Tesla A100 PCIe GPU cards, 52 CPU cores, and 192 GB of CPU RAM
2 nodes which each have 4 x 80GB Tesla A100 PCIe GPU cards, 64 CPU cores, and 512 GB of CPU RAM

The maximum run time is 48 hours for a job on these nodes. Feinberg members of the Genomics Compute Cluster should use the partition genomics-gpu, while non-Feinberg members should use genomicsguest-gpu. To submit a job to these GPUs, include the appropriate partition name and specify the type and number of GPUs:

#SBATCH -A b1042
#SBATCH -p genomics-gpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.

Other Resources for Genomic Workflows#

Software Provided As Modules#

Software are provided as modules on Quest that users can load and unload as needed. This includes many of the software tools used for genomics research. Helpful commands for interacting with the module system are as follows:

module spider blast        # searches for modules that include 'blast' in their name
module load blast/2.16.0   # loads NCBI Blast version 2.16.0
module list                # lists loaded modules
module unload blast/2.16.0 # unloads NCBI Blast version 2.16.0
module purge               # unloads all modules

You can also reference the Quest Software Page on the IT website to search provided modules as well.

Access to Reference Genomes#

Commonly used reference genomes, indexes, and annotations are provided in the directory /projects/genomicsshare so that each user does not have to download their own copy. The AWS iGenomes are provided within /projects/genomicsshare/AWS_iGenomes and organized by organism, followed by annotation type.

NCBI Blast databases are also provided within /projects/genomicsshare/blast.<date> where <date\ is the date that version of the database was downloaded in YYYYMMDD format.

If there is a resource you would like to use that you don’t see available, please let us know by emailing quest-help@northwestern.edu.

`nu-genomics` Profile for Nextflow#

Nextflow is a workflow management language that is available to Quest users via the module system. Many with community-developed and validated pipelines for common genomic analyses are written in Nextflow and available through nf-core. These pipelines can be run on Quest using the GCC resources with the nu-genomics profile by anyone who is a member of b1042.

More information about the nu_genomics profile: nf-core/configs

NCBI SRA Toolkit (fasterq-dump example)#

Users can retrieve FASTQ files from the SRA using the fasterq-dump command from the SRA Toolkit.

First, load the required module:

$ module load sratoolkit/3.0.0

Then run the command:

$ fasterq-dump --threads n --progress SRR_ID

In this command:

--threads specifies the number of processors/threads to be used.
--progress is an optional flag that displays a progress bar.
SRR_ID is the SRA run ID, starting with “SRR” and followed by digits.

We recommend large downloads be done within a job script and not from a login node.