Slurm Job Scheduler

Slurm Job Scheduler#

Overview#

Slurm is the software on Quest that manages and allocates requests for compute resources. Users submit computational “jobs” to Quest requesting the CPUs, nodes, memory, and other resources they need. Slurm manages these requests and allocates resources to jobs as they are available.

Jobs that run without user supervision are called batch jobs. Jobs where the user will actively use an application with the requested compute resources are called interactive jobs.

Slurm decides which jobs to run when based on the requested resources and your Fairshare score, which takes into account the priority of your Slurm account and your past usage of Quest.

Tip

A few common misconceptions about batch jobs:

We don’t actually aim for 100% efficiency. Job resources can fluctuate and we don’t want the job to run out of memory. We often aim for 75% efficiency.
More memory doesn’t always mean a faster job. There are many factors that influence how long a job runs for. Our team is happy to help if you have any questions.

The Job Submission Script#

Slurm requires users to write a submission script to run a batch job. It is a Bash script that specifies what resources the job needs to run, how to handle output and errors, and what commands to run as part of the job.

Example Submission Script#

Create a .sh file for the submission script. The file will have #SBATCH statements at the top that give Slurm the information it needs to schedule and run the job. After that, you can enter commands to load data, run code files, or do the other work of the job.

In all examples, <> denotes a value that you need to fill in. Fill in these values and remove the <>.

Example: jobscript.sh

#!/bin/bash
#SBATCH --account=<account>  ## Required: your Slurm account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=<partition> ## Required: buyin, short, normal, long, gengpu, genhimem, etc.
#SBATCH --time=<HH:MM:SS>       ## Required: How long will the job need to run?  Limits vary by partition
#SBATCH --nodes=<#>             ## How many computers/nodes do you need? Usually 1
#SBATCH --ntasks=<#>            ## How many CPUs or processors do you need? (default value 1)
#SBATCH --mem=<#G>              ## How much RAM do you need per computer/node? G = gigabytes
#SBATCH --job-name=<name>       ## Used to identify the job 

# load any modules needed
module load mamba/24.3.0

# set or change your working directory if needed
cd ~/myscripts

# run any commands or code files
date
python --version
python -c "print('hello')"

The first line of the script, #!/bin/bash, loads the Bash shell; it is required. Only the lines that begin with #SBATCH are interpreted by Slurm at the time of job submission. Normally in Bash, # is a comment character which means that anything written after a # is ignored by the Bash interpreter/language. When writing a submission script, however, the Slurm interpreter recognizes #SBATCH as a command. Any words following ## on the #SBATCH lines are treated as comments and ignored by Slurm interpreter.

Once Slurm places the job on a compute node, the remainder of the script (everything after the last #SBATCH line) is run. After the Slurm commands, the rest of the script works like a regular Bash script. You can modify environment variables, load modules, change directories, and execute program commands. Lines in the second half of the script that start with # are comments, such as # load any modules needed in the the example above.

Example values for #SBATCH options:

#SBATCH --account=p00000      ## use your Slurm account name
#SBATCH --partition=short
#SBATCH --time=01:00:00       ## one hour
#SBATCH --nodes=1             ## 1 node
#SBATCH --ntasks=1            ## 1 processor
#SBATCH --mem=2G              ## 2 GB of RAM
#SBATCH --job-name=sample_job  

More information on how to choose the values of the Slurm options is in Slurm Configuration Settings.

Note

The compute and memory resources you request affect your Fairshare score. Request only what you need for your job.

Tip

There are additional example job submission scripts in the RCDS Example Job Repository on GitHub .

Submitting A Batch Job#

After you have written and saved your submission script, you can submit your job. At the command line type

$ sbatch <name_of_script>

where, in the example above <name_of_script> would be jobscript.sh. If your submission script is not in your current working directory, either change to that directory or specify the path to the submission script as part of the command.

Upon submission the scheduler will return your job number:

Submitted batch job 549005

If you have a workflow that accepts or needs the jobid as an input for job monitoring or job dependencies, then you may prefer the return value of your job submission be just the job number. To do this, pass the --parsable argument:

$ sbatch --parsable <name_of_script>
549005

If there is an error in your job submission script, the job will not be accepted by the scheduler and you will receive an error message right away, for example:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

If your job submission receives an error, you will need to address the issue and resubmit your job. If no error is received, your job has entered the queue and will start when resources are available. See Slurm Commands and Job Management to learn how to get information about your job.

Job Scheduling and Fairshare#

The Slurm job schedule allocates submitted jobs to available resources based on several factors, resulting in an overall priority score.

Each Slurm account on Quest has a fairshare score that determines the relatively priority of jobs submitted as a part of that account. The fairshare score is used to prioritize jobs when more resources have been requested than are available. If the job queue is empty and compute resources are available, regardless of your fairshare score, your jobs will run. You will never run out of compute hours in this model.

Job priority is affected by:

Recent Use of Quest: The fairshare score changes over time based on recent past use of Quest. If you or other members of your Slurm account used large amounts of resources over the past few weeks, the priority of current jobs will be lower. The contribution of past resource usage to priority calculations decays over time. Without new usage, the fairshare scores are restored significantly within a month.
Allocation Type: General Access Research I allocations have lower starting fairshare scores, and thus priority, than Research II allocations. This is because Research II allocations are provided for research projects requiring more total computational resources; Research II allocations receive a greater share of system resources.
Job Age: Age is length of time an eligible job has been waiting in the queue. Jobs will accumulate priority proportional to their age.
Partition: For some Priority Access allocations, job priority varies based on the partition that the job uses.

If you are using a General Access allocation, the Fairshare scores of your jobs will be affected by the overall resource usage by all members of the allocation. If you are using a Priority Access allocation that has its own compute nodes, your own usage of the those nodes will be the determining factor for your job’s Fairshare score and priority relative to other users of your Priority Access allocation.

More detailed information about Slurm’s multi-factor priority system, please see the Slurm documentation .

Partitions#

All jobs must specify a partition. Partitions determine which compute nodes the job can be scheduled on.

General Access Partitions#

Partitions that run on regular compute nodes without special resources.

Partition	Minimum Wall Time HH:MM:SS	Maximum Wall Time HH:MM:SS	Notes
short	00:00:00	04:00:00	For jobs that will run in 4 hours or less. The short partition has access to most compute nodes on Quest.
normal	04:00:00	48:00:00	For jobs that will run between 4 hours and 2 days. The normal partition has access to more compute nodes than the long partition but fewer than the short partition.
long	48:00:00	168:00:00	The long partition is for jobs that will run between 2 days and 7 days. The long partition has access to fewer compute nodes than the short and normal partitions.

Jobs scheduled on the short partition typically start the soonest due to the greater number of compute nodes available and the shorter time required. Consider splitting up jobs to run in less than 4 hours when possible.

If you have a General Access allocation and need to run jobs longer than one week, contact quest-help@northwestern.edu for a consultation. Some special accommodations can be made for jobs requiring the resources of up to a single node for a month or less.

Partitions that run on specialty compute nodes with GPU or high-memory resources.

Partition	Maximum Wall Time HH:MM:SS	Notes
gengpu	48:00:00	Only for jobs requiring GPUs. In addition to entering gengpu as the partition in the submission script, the script must specify `#SBATCH --gres=gpu:a100:X`, where `X` is the number of GPUs required. See GPUs on Quest for details.
genhimem	48:00:00	Only for jobs requiring more than 473 GB memory per node. This partition has access to a 52-core node with 1480 GB of schedulable memory.

Priority Access Partitions#

Priority Access partitions are available to users who are part of allocations containing purchased compute resources . The resources available and any limits on jobs are governed by the specific policies of the Priority Access allocation.

In addition to setting the -A/--account value to the Slurm account name, set the partition to either the Slurm account name or “buyin”.

Example:

#SBATCH -A b1234
#SBATCH -p b1234

or

#SBATCH -A b1234
#SBATCH -p buyin

The wall time limits vary by partition. Priority resource allocations have different wall time limits as well.

Some Slurm accounts, such as the GCC b1042 account, have additional partition names. If your account has account-specific partitions, use those partition names instead of the account name or “buyin”.

Warning

Priority Access allocations cannot use the General Access partitions.

Slurm Configuration Settings#

There are many options and settings available when submitting a job beyond the #SBATCH options shown in the example submission script. Common options, their possible values, and considerations for choosing appropriate values are listed here.

Tip

Slurm offers two methods, one short (such as -A) and one long/verbose (--account) to indicate most settings. For the long option name, an = sign is required after the flag (ex. --account=p00000).

Additional Options

Option	Slurm (sbatch)	Description
Request GPUs	`--gres=name[[:type]:count]`	The “name” field will always be `gpu`. The “type” is an optional classification for the resource (e.g. a100, h100, a30, l40s). Please note that when submitting to `gengpu` if the exact GPU card is not important for your job, please omit this option as it will result in lower wait times when matching the job to a GPU resource. The “count” field is the number of GPU cards with a default value of 1.
Job array	`--array=<array indexes>`	Submit a job array, a type of submission which will launch multiple jobs to be executed with identical parameters. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator. For example, `--array=0-15` or `--array=0,6,16-32`.
Copy environment	`--export=ALL` (default) `--export=NONE` ## to not export environment	Optional: Default is to export ALL environmental settings from the submission environment to the runtime environment.
Copy environment variable	`--export=<variable[=value][,variable2=value2[,…]]>`	Example: `--export=EDITOR,ARG1=test` In this example, the propagated environment will only contain the variable `EDITOR` from the user’s environment, `SLURM_*` environment variables, and `ARG1=test`.
Job dependency	`--dependency=after:jobID[:jobID…]` `--dependency=afterok:jobID[:jobID…]` `--dependency=afternotok:jobID[:jobID…]` `--dependency=afterany:jobID[:jobID…]`	After the specified jobs start or are cancelled. This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero). This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc). This job can begin execution after the specified jobs have terminated. This is the default dependency type.
Defer job until the specified time	`--begin=<date/time>`	Submit the batch script to the Slurm controller immediately, like normal, but tell the controller to defer the scheduling of the job until the specified time. Time may be of the form HH:MM:SS to run a job at a specific time of day (seconds are optional).
Node exclusive job	`--exclusive`	The job is allocated all CPUs and GRES on all requested nodes, but is only allocated as much memory as it requested. To request all the memory on the allocated nodes as welle, use `--mem=0`.
Instead of specifying how many nodes you want, you could request a specific set of compute nodes. This cannot be used in combination with the `--nodes` setting	`-w`,`--nodelist=<node1[,node2[,...]]>` `-F`, `--nodefile=<node file>`	Request a specific list of hosts. The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements

Environmental Variables Set by Slurm

Multiple variables are set by Slurm and are accessible in the environment of a job after the job has started running.

Info	Variable Name
Job name	`$SLURM_JOB_NAME`
Job ID	`$SLURM_JOB_ID`
Submission directory	`$SLURM_SUBMIT_DIR`
Node list	`$SLURM_JOB_NODELIST` `$SLURM_NODELIST`
Job array index	`$SLURM_ARRAY_TASK_ID`
Partition name	`$SLURM_JOB_PARTITION`
Number of nodes allocated	`$SLURM_JOB_NUM_NODES` `$SLURM_NNODES`
Number of processes	`$SLURM_NTASKS`
Number of processes per node	`$SLURM_TASKS_PER_NODE`
Requested tasks per node	`$SLURM_NTASKS_PER_NODE`
Requested CPUs per task	`$SLURM_CPUS_PER_TASK`
Scheduling priority	`$SLURM_PRIO_PROCESS`
Job user	`$SLURM_JOB_USER`
Login node from which the job was submitted	`$SLURM_SUBMIT_HOST`

Slurm Schedulable Resources#

The Quest Storage and Compute Resources page details the raw resources available on Quest. In practice, not all of these resources are available for user jobs. Memory on each compute node is dedicated for use by the Quest file system and Operating System to improve the stability and performance of Quest. The amount of schedulable memory is dependent on the architecture/generation of the node. A table is provided below which summarizes the amount of schedulable resources per generation and the Slurm constraint options associated with them.

Node Family Name	Number of Schedulable CPUs	Amount of Schedulable Memory/RAM	Constraints
quest10	52	166GB	`quest10`
quest10 GPU	52	166GB	`quest10`,`pcie`
quest11	64	221GB	`quest11`
quest12	64	221GB	`quest12`
quest12 GPU	64	473 GB	`quest12`,`sxm`
quest13	128	473GB	`quest13`
quest13 GPU	64	976 GB	`quest13`,`sxm`

Slurm Commands and Job Management#

Once a job has been submitted, additional Slurm commands are available to monitor and manage the job.

Common Slurm Commands

Action	Slurm Command
Delete a job	`scancel <job ID>`
Job status (by job)	`squeue -j <job ID>`
Job status (by user)	`squeue -u <netid>`
Job status (detailed)	`scontrol show job -dd <job ID>` `checkjob <job ID>`
Show expected start time	`squeue -j <job ID> --start`
Queue list / info	`scontrol show partition [queue]`
Hold a job	`scontrol hold <job ID>`
Release a job	`scontrol release <job ID>`
Monitor or review a job’s resource usage	`sacct -j <job_num> --format=JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed`
View job batch script	`sacct -j <job_num> -B`

Further details and options for these commands are in the sections below.

List Current Jobs with squeue

The squeue command can be used display information about current jobs on Quest. squeue alone will show all jobs across all users. Use the -u option to limit the output to your NetID.

Command	Description
`squeue -u <netid>`	Show only jobs belonging to user specified
`squeue -A <account-name>`	Show only jobs belonging to Slurm account specified
`squeue -j <jobid>`	Display the status of the specified job
`squeue -t R -u <netid>`	Show running jobs for the specified user
`squeue -t PD -u <netid>`	Show pending jobs for the specified user
`squeue --help`	See documentation and additional options

List Current and Past Jobs with sacct

The sacct command can be used display information about your past and current jobs on Quest. By default, sacct will only display information about jobs from today. Unlike the squeue command, there is no need to supply -u <netid> to sacct as it will do this by default. The default output includes just a few fields: job ID, jobname, partition, account, allocated CPUs (AllocCPUs), state, and exit code.

Example:

$ sacct -X 
JobID      JobName Partition  Account AllocCPUS   State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1453894      bash      short   p1234     1 COMPLETED   0:0
1454434   sample_job      short   p1234     52   FAILED   6:0 

We strongly recommend including the -X flag when using sacct. This flag will suppress the output to only show statistics relevant to the job itself and will exclude the individual job steps which are not relevant for the vast majority of situations.

The lone exception to this is when you would like to display information related to the GPU utilization of the job. this information is contained in the “batch” step of the job, and can be displayed for a given job via the following command:

$ sacct -j <slurm-jobid>.batch --format=jobid,tresusagein%120

For example,

$ sacct -j 3547592.batch --format=jobid,tresusagein%120
JobID                                                                                                                  TRESUsageInAve 
------------ ------------------------------------------------------------------------------------------------------------------------ 
3547592.bat+          cpu=00:04:56,energy=0,fs/disk=13439086206,gres/gpumem=6218M,gres/gpuutil=95,mem=5825476K,pages=0,vmem=21553268K 

The state codes are available in the Slurm documentation . The exit codes do not have standard meanings across all jobs, but they can be used in troubleshooting. A normal exit code with no errors is 0:0.

To include additional information in the output of sacct, add the --format option with a comma separated list of values for the additional fields . The values are not case-sensitive.

$ sacct --format=var_1,var_2, ... ,var_N

Example:

$ sacct -X --format=jobid,priority

See the Slurm documentation for the full list of additional fields that can be included.

To retrieve the submission script used by a job, use:

$ sacct -j <job_num> -B

To display jobs from a shorter or longer time period than the default (today), use the --starttime and/or --endtime options:

$ sacct -X --starttime=03/14/25 --format=jobname,nnodes,ncpus,elapsed

Job records are kept for about a year.

Detailed Job Information with checkjob

The checkjob command displays detailed information about a submitted job’s status and diagnostic information that can be useful for troubleshooting submission issues. It can also be used to obtain useful information about completed jobs such as the allocated nodes, resources used, and exit codes.

$ checkjob <jobid>

where you can get the <jobid> using the squeue command.

Example for a Successfully Running Job

$ checkjob 548867
--------------------------------------------------------------------------------------------------------------------
JOB INFORMATION
--------------------------------------------------------------------------------------------------------------------
JobId=548867 JobName=high-throughput-cpu_000094
    UserId=abc123(123123) GroupId=abc123(123) MCS_label=N/A
    Priority=1315 Nice=0 Account=p12345 QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:13:13 TimeLimit=00:40:00 TimeMin=N/A
    SubmitTime=2019-01-22T12:51:42 EligibleTime=2019-01-22T12:51:43
    AccrueTime=2019-01-22T12:51:43
    StartTime=2019-01-22T15:52:20 EndTime=2019-01-22T16:32:20 Deadline=N/A
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
    LastSchedEval=2019-01-22T15:52:20
    Partition=short AllocNode:Sid=quser21:15454
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=qnode[5056-5060]
    BatchHost=qnode5056
    NumNodes=5 NumCPUs=120 NumTasks=120 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=120,mem=360G,node=5,billing=780
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=(null)
    WorkDir=/projects/p12345/high-throughput
    StdErr=/projects/p12345/high-throughput/lammps.error
    StdIn=/dev/null
    StdOut=/projects/p12345/high-throughput/lammps.output
    Power=
--------------------------------------------------------------------------------------------------------------------
JOB SCRIPT
--------------------------------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --account=p12345
#SBATCH --partition=normal
#SBATCH --job-name=high-throughput-cpu
#SBATCH --ntasks=120
#SBATCH --mem-per-cpu=3G
#SBATCH --time=00:40:00
#SBATCH --error=lammps.error
#SBATCH --output=lammps.output

module purge
module load lammps/lammps-22Aug18

mpirun -n 120 lmp -in in.fcc

Note in the output above that:

The JobState is listed as RUNNING.
The time passed since job start (RunTime) and the total walltime requested (TimeLimit) are listed.
The node name(s) are listed after NodeList.
The paths to job’s working directory (WorkDir), standard error (StdErr) and output (StdOut) files are given.
If a batch job script is used for submission, the script is presented at the end.

Cancelling Jobs with scancel

You can cancel one or all of your jobs with scancel. Proceed with caution, as this cannot be undone, and you will not be prompted for confirmation after issuing the command.

Command	Description
scancel	Cancel the job with given job ID
scancel -u	Cancel all the jobs of the user

Holding, Releasing, or Modifying Jobs with scontrol

Users can place their jobs in a “JobHeldUser” state while submitting the job or after the job has been queued. Running jobs cannot be placed on hold. Placing a job on hold means that the system will set its priority to 0 and not attempt to schedule it until the hold is removed.

Command	Description
`#SBATCH -H`	Place hold within the job script
sbatch -H	Place hold while submitting from command line
scontrol hold	Place hold on a queued job from command line

The job status will be shown in the output of monitoring commands such as squeue or checkjob.

To release a job from user hold state:

$ scontrol release <jobid>

The job control command (scontrol) can also be used for changing the parameters of a submitted job before it starts running. The following parameters can be modified safely:

Job dependency (change to “none”)
Partition
Job name
Wall clock limit
Slurm Account

Examples of using scontrol to change a job’s parameters:

Command	Description
`scontrol update job=<jobid> dependency=afterok:1000`	Change job to depend on the successful completion of the job 1000
`scontrol update job=<jobid> partition=short`	Change partition to short
`scontrol update job=<jobid> name=myjob`	Change name to myjob
`scontrol update job=<jobid> timelimit=2:00:00`	Set job time limit to 2 hours
`scontrol update job=<jobid> account=p12345`	Change the account to p12345

For a complete listing of scontrol options, see the official scontrol documentation .

Probing Priority with sprio

Slurm implements a multi-factor priority scheme for ordering the queue of jobs waiting to be run. sprio command is used to see the contribution of different factors to a pending job’s scheduling priority.

The sprio command can be helpful to get an idea of where your jobs might be in the overall queue of pending jobs. However, we caution against using it as a tool to estimate specific wait times. The high-performance computing systems are complex and have many different components that impact when jobs will run. This leads to variation in wait time. Similar to reservations at a restaurant, sprio can be helpful to tell you that you’re 1 of the next 10 groups to get a table, but it doesn’t necessarily mean that you’ll be seated in the next 5 minutes.

Command	Description
`sprio`	Show scheduling priority for all pending jobs for the user
`sprio -j <jobid>`	Show scheduling priority of the defined job

For running jobs, you can see the starting priority using checkjob <jobid> command.

Special Types of Job Submissions#

In this section, we provide details and examples of how to use Slurm to run:

Interactive jobs where the user has access to the allocated compute resources to run commands interactively
Job arrays that submit multiple jobs using the same job submission script
Jobs that depend on other jobs completing first

Interactive Jobs

This section explains how to start interactive jobs from the command line on the Quest login nodes.

Tip

Quest OnDemand provides a great alternative for launching interactive jobs through your web browser.

Jobs without GUIs

To launch an interactive job in order to run an application without a GUI, use either the srun or salloc command instead of sbatch.

srun

If you use srun to run an interactive job, Slurm will automatically launch a terminal session on the compute node after it schedules the job and the job starts.

Warning

When using srun , if you lose your connection to Quest, the interactive job will terminate.

Instead of writing a job submission script as you would do for a batch job, for an interactive job, you can specify the key options as part of the srun command directly. The same options for sbatch can be used with srun.

Example:

[quser41 ~]$ srun --nodes=1 --ntasks=1 --account=<account> --mem=<memory> \ # continue next line
> --partition=<partition> --time=<hh:mm:ss> --pty bash -l
srun: job 3201233 queued and waiting for resources
srun: job 3201233 has been allocated resources
----------------------------------------
srun job start: Mon Mar 14 13:25:41 CDT 2022
Job ID: 3201233
Username: abc123
Queue: short
Account: pXXXXX
----------------------------------------
The following variables are not
guaranteed to be the same in
prologue and the job run script
----------------------------------------
PATH (in prologue) : /usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/lpp/mmfs/bin:/opt/ibutils/bin
WORKDIR is: /home/<netid>
----------------------------------------
[qnode0114 ~]$

In the example above, the initial command has been split across two lines due to space constraints in the documentation. The command can be run on a single line. quser41 is a login node, while qnode0114 is a compute node. --pty bash indicates to start a bash shell on the compute node. Adding the option -l to the bash command will start a login bash shell on the compute node. This means that the environment for the bash shell will be the same as the one you have on the login nodes.

You can stop your interactive session entering the command exit.

[qnode0114 ~]$ exit
[quser41 ~]$

salloc

Unlike srun, salloc does not automatically launch a terminal session on the compute node. Instead, after it schedules your job, it will tell you the name of the compute node the job has been scheduled on. Then you can run ssh qnodeXXXX to directly connect to the compute node. If you lose connection to your interactive session started with salloc, the interactive job will not terminate.

Example:

[quser41 ~]$ salloc --nodes=1 --ntasks=1 --account=<account> --mem=<memory> --partition=<partition> --time=<hh:mm:ss>
salloc: Pending job allocation 276305
salloc: job 276305 queued and waiting for resources
salloc: job 276305 has been allocated resources
salloc: Granted job allocation 276305
salloc: Waiting for resource configuration
salloc: Nodes qnode0114 are ready for job
[quser41 ~]$ ssh qnode0114
Warning: Permanently added 'qnode0114,172.20.134.29' (ECDSA) to the list of known hosts.
[qnode0114 ~]$

In the example above, quser41 is a login node, and qnode0114 is a compute node.

You can stop your interactive session entering the command scancel <slurm-jobid>.

[qnode0114 ~]$ scancel 276305

Jobs with GUIs

To launch an interactive job in order to run an application with a GUI, first you need to connect to Quest using an application with X11 forwarding support. We recommend using FastX. Once you have connected to Quest with X11 forwarding enabled, you can then use either the srun or salloc command with the --x11 option added.

Examples:

$ srun --x11 --nodes=1 --ntasks=1 --account=<account> --mem=<memory> --partition=<partition> --time=<hh:mm:ss> --pty bash -l

$ salloc --x11 --nodes=1 --ntasks=1 --account=<account> --mem=<memory> --partition=<partition> --time=<hh:mm:ss>

Job Arrays

Job arrays can be used to submit multiple jobs at once that use the same job submission script. This can be useful if you want to run the same script multiple times with different input parameters.

A job array is created with the addition of the --array option to a job submission script, and using the $SLURM_ARRAY_TASK_ID environment variable to keep track of which job in the array is running. It is useful to update the job name and output files to incorporate the array ID in the filenames so that a separate log file is created for each job.

Example submission file: jobsubmission.sh

#!/bin/bash
#SBATCH --account=<account>  ## Required: your Slurm account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=<partition> ## Required: buyin, short, normal, long, gengpu, genhimem, etc.
#SBATCH --time=<HH:MM:SS>       ## Required: How long will the job need to run?  Limits vary by partition
#SBATCH --nodes=<#>             ## How many computers/nodes do you need? Usually 1
#SBATCH --ntasks-per-node=<#>   ## How many CPUs or processors do you need on per computer/node? (default value 1)
#SBATCH --mem=<#G>              ## How much RAM do you need per computer/node? G = gigabytes
#SBATCH --array=0-9             ## number of jobs to run: here, 10 jobs, labelled 0 through 9 
#SBATCH --job-name="sample_job_\${SLURM_ARRAY_TASK_ID}"   ## use the array id in the name of the job
#SBATCH --output=sample_job.%A_%a.out                     ## use the jobid (%A) and the array index (%a) to name the log files

module purge all
module load python-anaconda3
source activate /projects/intro/envs/slurm-py37-test

# Read in the different input arguments from a file input_args.txt
IFS=$'\n' read -d '' -r -a input_args < input_args.txt

python slurm_test.py --filename ${input_args[$SLURM_ARRAY_TASK_ID]}

This script will create 10 jobs, labelled with the job array indices 0 through 9. Each job runs the same Python script, slurm_test.py using different input arguments from the file input_args.txt.

input_args.txt contains:

filename1.txt
filename2.txt
filename3.txt
filename4.txt
filename5.txt
filename6.txt
filename7.txt
filename8.txt
filename9.txt
filename10.txt

Tip

Make sure that the number of lines in the input file matches the number of jobs specified by --array.

myscript.py contains the following code to read a --filename argument from the command line:

import argparse
import time

def parse_commandline():
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--filename",
                        help="Name of file",
                        default=None)
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_commandline()
    print(args.filename)

myscript.py will receive the value from input.csv as an argument. The first job in the array will receive “filename1.txt” as the input, the second job will receive “filename2.txt” as the input, etc.

Submit the script as normal with sbatch:

$ sbatch jobsubmission.sh

The job array will then be submitted to the scheduler with each array element requesting the same resources (such as number of cores, time, memory etc.) per job.

Dependent Jobs

Dependent jobs are a series of jobs which run or wait to run conditional on the state of another job. For instance, you may submit two jobs and you want the first job to complete successfully before the second job runs. This is helpful if one job needs to produce a data file or other output file that is an input to another job. In order to submit this type of workflow, you pass sbatch the jobid of the job that needs to finish before this job starts via the command line argument:

--dependency=afterok:<jobid>

You can manually submit a series of jobs, but it is helpful to write all of your sbatch submission commands in a bash script and pass the job IDs programmatically. In order to be able to capture and pass the job ID of one job to the next, save the output of a call to sbatch in a variable.

Here is an example submitting 3 jobs in sequence, where each job depends on the previous job completing before it runs. This example uses the same job submission script, example_submit.sh for each job, but this is not required. You can use different submission scripts with different resource requests for each job.

Example: wrapper_script.sh

#!/bin/bash
# submit the first job: no special options because it's the first job
jid0=$(sbatch --parsable example_submit.sh)

# submit the second job: dependent on the first job with ID stored in jid0
jid1=$(sbatch --parsable --dependency=afterok:${jid0} --export=DEPENDENTJOB=${jid0} example_submit.sh)

# submit the third job: dependent on the second job with ID stored in jid1
jid2=$(sbatch --parsable --dependency=afterok:${jid1} --export=DEPENDENTJOB=${jid1} example_submit.sh)

The variables jid0, jid1, and jid2 will contain the job ID that Slurm assigns each job.

Tip

Anything you can tell Slurm via #SBATCH in the submission script itself you can also pass to sbatch via the command line.

example_submit.sh

#!/bin/bash
#SBATCH --account=<account>  ## Required: your Slurm account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=<partition> ## Required: buyin, short, normal, long, gengpu, genhimem, etc.
#SBATCH --time=<HH:MM:SS>       ## Required: How long will the job need to run?  Limits vary by partition
#SBATCH --nodes=<#>             ## How many computers/nodes do you need? Usually 1
#SBATCH --ntasks-per-node=<#>   ## How many CPUs or processors do you need on per computer/node? (default value 1)
#SBATCH --mem=<#G>              ## How much RAM do you need per computer/node? G = gigabytes
#SBATCH --output=job_%A.out     ## include the job ID in the output file name

# very simple job to just print information to the output file

# print the date and time to the output file
date

# print out the ID of the job this one was dependent on
if [[ -z "${DEPENDENTJOB}" ]]; then
    echo "First job in workflow"
else
    echo "Job started after " $DEPENDENTJOB
fi

$ bash wrapper_script.sh

Another way to run this example, would be to make wrapper_script.sh executable with chmod +x wrapper_script.sh and invoke it with ./wrapper_script.sh.

This will submit the three jobs in sequence. Using squeue -u <netid> after running the above command, you should see jobs 2 and 3 pending for reason DEPENDENCY.

Troubleshooting#

Debugging a Job Submission Script Rejected By The Scheduler#

If your job submission script generates an error when you submit it with the sbatch command, the problem in your script is in one or more of the lines that begin with #SBATCH.

Errors can be difficult to identify, and often require careful reading of your #SBATCH lines. To debug job scripts that generate error messages:

Look up the error message in the section below to identify the most likely reason your script received that error message.
Once you have identified the issue with your script, edit the script to correct it and resubmit your job.
If you receive the same error message again, examine the error message and the mistake in your script more closely. Sometimes the same error message can be generated by two different issues in the same script, meaning it is possible that you may resolve the first issue but need to correct a second issue to clear that particular error message.
When you resubmit your job you may receive a new error message. This means the issue that generated the first error message has been resolved, and now you need to fix another issue.

When Slurm encounters a problem in your job submission script, it does not read the rest of your script that comes after the error. Slurm returns up to two distinct error messages at a time. If your submission script has more than two problems, you will need to resubmit your job multiple times to identify and fix all of them.

Hidden Characters#

With certain combinations of GUI editors and character sets on your personal computer, copying and pasting into Quest job submission scripts may bring in specific hidden characters that interfere with the scheduler’s ability to interpret the script. In these cases, #SBATCH lines will have no mistakes but still generate errors when submitted to the scheduler. To see all of the hidden characters in your job submission script, use the command cat -A <script_name> from the command prompt to print out the contents of the file. To resolve this, you may need to type your submission script into an editor like vi and not use copy and paste.

Common Error Messages#

The errors listed below may also be generated by interactive job submissions using srun or salloc. In those cases, the error messages will begin with “srun error” or “salloc error.”

Debugging a Job Accepted by the Scheduler#

Once your job has been accepted, the Slurm scheduler will return a job ID number. After waiting in the queue, your job will run. To see the status of your job, use the command sacct -X.

Not all job submission errors generate error messages; if the output from your job is unexpected or incorrect, there may be an issue with the submission script. If your script’s required elements (account, partition, nodes, cores, and wall time) have been read successfully before Slurm encounters the error, your job will still be accepted by the scheduler and run, just not the way you expect it to. Scripts with issues that don’t generate errors still need to be debugged since the scheduler has ignored some of your #SBATCH lines.

For jobs with mistakes that do not give error messages, you will need to investigate if you notice something is wrong with how the job runs.

Common problems include:

Troubleshooting Failed Jobs#

There are two common reasons for job failure outside of errors in the code being executed:

Troubleshooting Memory Requests#

My job ran out of memory and failed, now what?

Often a challenge that researchers face is that the amount of memory required to run an initial job may be unknown. Running out of memory will cause the job to fail with an out of memory (OOM) error.

A good strategy is to run a test job to determine how much memory your job needs, and then request that amount of memory + 10% when submitting your full job. In the example below, you’ll see that we start by requesting all of the memory for a node. This may be much more memory than we need, but it gives us a starting point to ensure that the job completes. After the job completes, we’ll use specific applications to see how much memory was actually used by the job and adjust accordingly.

To do this:

Create a test job by editing your job’s submission script to reserve all of the memory of the node it runs on
Run your test job
Confirm your test job has completed successfully
Use seff to see how much memory your job actually used
Submit your full job with new memory limits

1. Create a test job

To profile your job’s memory usage, create a test job by modifying your job’s submission script to include the lines:

#SBATCH --mem=0
#SBATCH --nodes=1

Setting --mem=0 reserves all of the memory on the node for your job; if you already have a --mem= directive in your job submission script, comment it out. Now your job will not run out of memory unless your job needs more memory than is available on the node.

Setting --nodes=1 reserves a single node for your job. For jobs that run on multiple nodes such as MPI-based programs, request the number of nodes that your job utilizes. Be sure to specify a value for #SBATCH --nodes= or the cores your job submission script reserves could end up on as many nodes as cores requested. Be aware that by setting --mem=0, you will be reserving all the memory on each of those nodes that your cores are reserved on.

Run your test job

Submit your test job to the cluster with the sbatch command. For interactive jobs, use srun or salloc.

Did your test job complete successfully?

When your job has stopped running, use the sacct -X command to confirm your job finished with state “COMPLETED”. If your test job finishes with an “OUT_OF_ME+” state, confirm that you are submitting the modified job submission script that requests all of the memory on the node. If the “OUT_OF_ME+” errors persist, your job may require more memory than is available on the compute node it ran on. In this case, please email quest-help@northwestern.edu for assistance.

How much memory did your job actually use?

To see how much memory it used run the command: seff <test_job_id_number>. This returns output similar to:

Job ID: 767731
    Cluster: quest
    User/Group: abc123/abc123
    State: COMPLETED (exit code 0)
    Cores: 1
    CPU Utilized: 00:10:00
    CPU Efficiency: 100.00% of 00:10:00 core-walltime
    Job Wall-clock time: 00:10:00
    Memory Utilized: 60.00 GB
    Memory Efficiency: 50.00% of 120.00 GB

Check the job State reported in the 4th line. If it is “COMPLETED (exit code 0)”, look at the last two lines. “Memory Utilized” is the amount of memory your job used, in this case 60Gb.

If the job State is FAILED or CANCELLED, the Memory Efficiency percentage reported by seff will be extremely inaccurate. The seff command only works on jobs that have COMPLETED successfully.

Slurm Job Scheduler

Contents

Slurm Job Scheduler#

Overview#

The Job Submission Script#

Example Submission Script#

Submitting A Batch Job#

Job Scheduling and Fairshare#

Partitions#

General Access Partitions#

Priority Access Partitions#

Slurm Configuration Settings#

Slurm Schedulable Resources#

Slurm Commands and Job Management#

Special Types of Job Submissions#

Troubleshooting#

Debugging a Job Submission Script Rejected By The Scheduler#

Hidden Characters#

Common Error Messages#

Debugging a Job Accepted by the Scheduler#

Troubleshooting Failed Jobs#

Troubleshooting Memory Requests#