Genomics Compute Cluster Quick Start

Genomics Compute Cluster Quick Start#

Welcome to the Genomics Compute Cluster (GCC) at Northwestern University. This guide is designed to help new GCC users get started with these resources.

Need more information than this Quick Start provides?

Check out our full GCC User Guide for more information on using these resources.

The GCC is a part of Quest, Northwestern’s high-performance computing cluster and new users will also benefit from the general Quest introductory content provided here as well as this video series

Accessing the GCC#

To use the GCC, users must be member of the b1042 allocation on Quest. This allocation grants access to compute partitions and shared scratch storage located at /projects/b1042. It is recommended that users also belong to a General Access or additional Priority Access allocation for broader functionality.

If you are not already a member, you can apply using this form . Applications are processed after attendance at a GCC Orientation Session.

SSH via Terminal#

You can connect via ssh from the command line using any terminal application:

ssh <netid>@login.quest.northwestern.edu
# replace <netid> with your NetID
# enter your NetID password and hit enter when prompted

OnDemand Browser Service#

For those who prefer graphical interfaces, Quest OnDemand provides a web-based interface accessible at ondemand.quest.northwestern.edu . This requires users to be connected to eduroam on campus or to use the Global Protect VPN off campus. More information about this service is available in the Quest OnDemand documentation page.

Navigating the File System and Data Storage#

Users are provided with three main directories on Quest:

/home/<netid>: 80GB backed-up personal space
/projects/b1042: scratch space purged monthly
/projects/pXXXXX: non-backed-up project space (1-2TB)

When you ssh to Quest, you will land in your home directory by default. This is where we recommend saving job scripts and other code, as well as personal software installs. It is the only part of the file system that we back up, but it is generally not large enough for research data so please maintain your own backups of other locations on Quest.

Storage usage can be monitored with:

homedu         # usage in home directory
checkproject   # usage and expiry of project allocations
b1042check     # usage in b1042 allocation

To navigate to the scratch storage space provided to b1042 members, you can use:

$ cd /projects/b1042

Other Linux commands to navigate and manage files include:

pwd       # print working directory
ls        # list directory contents
cd        # change directory
cp        # copy files
mv        # move files
rm        # remove files (cannot be undone)
nano      # command-line text editor

Using GCC Scratch Storage#

You are welcome to make your own folder within b1042 with the command mkdir, or to use a folder associated with your lab group.

All storage space within /projects/b1042 and its subdirectories is scratch, meaning temporary storage where unused files will be regularly deleted. The deletion process for b1042 is monthly. Ten days before the end of the month we audit /projects/b1042 for files that have not been accessed or modified within the current month and email users of files that meet this criteria for expiry and deletion. We also send a reminder email 2 days before the end of the month to those users, and then all files that have not been accessed or modified in the past 31 days are deleted on the first of every month.

File access and modification dates can be checked using stat:

stat filename.txt
stat *

For data transfer, Globus is highly recommended. Other options like scp and wget can also be used.

Using GCC Compute Resources#

Resources can be scheduled for either batch or interactive sessions. Batch jobs require a script and are submitted with sbatch. Here is an example of a batch job script:

#!/bin/bash
#SBATCH --account=b1042
#SBATCH --partition=genomics
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --mem=20G
#SBATCH --job-name=sample_job

module purge all
module load trinity/2.15.1

Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

To run this job, you could execute something like the following:

$ sbatch trinity_script.sh

The command above assumes the contents of the job script example are saved as a file called trinity_script.sh. The file extension .sh is standard for bash code, which this is, and trinity is the name of the software being used in this example. The command above also assumes you are running it from the folder in which this file is saved.

For more information on writing and submitting job scripts, please see the Slurm section of the Quest User Guide.

Interactive jobs are launched with srun or salloc. Here is an example command that would launch an interactive job on a GPU node for 10 minutes with 1GB of RAM:

$ srun -N 1 -n 1 --account=b1042 --mem=1G \
     --partition=genomics-gpu --gres=gpu:a100:1 \
     --time=00:10:00 --pty bash -l

In both the batch and interactive example, setting the Slurm account to b1042 and the partition to one of the GCC partitions is what specifies that this job will use the compute hardware set aside for the GCC. Because there are different resources available and different research needs, this hardware has been divided into different partitions.

How to Choose a b1042 Partition#

If you are affiliated with Feinberg or Weinberg, use the following:

Default partition: genomics
If your jobs takes more than 2 days: genomicslong
If your jobs needs >240GB of RAM: genomics-himem
If your job uses a GPU: genomics-gpu

If you are affiliated with any other school or institute, use the following:

Default partition: genomicsguest
If your job takes more than 2 days: genomicsguestex
If your job needs >240GB RAM: genomicsguest-himem
If your job uses a GPU: genomicsguest-gpu

How to Determine Resource Needs#

If you don’t know the resource requirements of your job, please start by requesting 4 hours, 1 node, and 3 GB of RAM per CPU from the genomics partition. This is enough time to check whether your job has completed within the same day and request longer if needed. Job status will be TIMEOUT if a job times out, and OUT_OF_MEMORY if it runs out of RAM. If you are unsure if your job can use multiple CPUs, you could start with 2 and check the efficiency, or you could just use 1 and see whether the jobs takes a reasonable amount of time to finish without parallelization. Once you have a completed job, you can check its resource usage and adjust as needed. Please run one test job to check resource usage before submitting jobs for all your samples.

Slurm commands to check on job history and efficiency:

sacct -u <netid> -X -S <startdate>   # job history for a user since a given start date
seff <jobID>                         # efficiency metrics for a provided job
checkjob <jobID>                     # job details and efficiency for a provided job
squeue -u <netid>                    # running/pending jobs for a given user

Software on Quest#

Quest uses a module system to manage software environments. Users can:

module spider                   # list available modules
module spider <software>        # search for specific software
module load software/version    # load a module
module list                     # list loaded modules

Users who wish to manage their own software may use virtual environments, Docker/Singularity containers, or self-compiled binaries, but we suggest checking if software is available from our module system before installing it yourself. You can request new software, or newer versions of software be added to Quest via the software support portal .

Genomic Resources on Quest#

Reference genomes are available to all Quest users in the iGenomes folder. These reference files can be useful for alignment, quantification, and other genomics workflows.

Path to iGenomes: /projects/genomicsshare/AWS_iGenomes

Getting Help#

For technical questions or issues, users can contact the GCC support team at quest-help@northwestern.edu . Additionally, users are encouraged to join community platforms such as genomics-rcs.slack.com to share knowledge and collaborate with other researchers.

We welcome you to the Genomics Compute Cluster and look forward to supporting your research!