High-Performance Computing Best Practices

High-Performance Computing Best Practices#

There can be a lot to learn as a new user of a high-performance computing (HPC) cluster like Quest. After you’ve learned about the structure and terminology of Quest through the Introduction to Quest, and worked your way through the Quest User Guide for more detailed information, we recommend you check out the best practices below to avoid common issues and use Quest effectively.

Login Nodes are Limited Resources

When you connect to an HPC cluster via ssh, you land on a login node. These nodes are the entry point for the system and are used to launch batch jobs or interact with the cluster. Running larger computations on these nodes can have a negative impact for all users on the cluster.

If you’re unsure if your action has an impact, you can run the top command on the login node. If you see that you have several processes running at over 100% utilization you should consider moving the processes to a computational code via a scheduled, batch job.

$ top -u <netid> #Replace <netid> with your NetID

Learn more: Scheduling Jobs with Slurm

Home Folder Space is Limited

As an HPC user, you automatically have a home directory setup for you. On Quest, these directories are only 80 GB and are secured so that just you have access. They are great for small files or temporary testing. However, they fill up quickly, and, when they do, they will cause odd behavior and failures in the system.

If you see an error that you hadn’t previously experienced, the first command to run is homedu. This will tell you how much space is free in your home directory and which files are taking up space. You can then remove any unneeded files to free up space in your home directory.

$ homedu

Learn more: Quest file systems

Use Compute Resources Efficiently

HPC clusters are shared resources. Requesting resources that your job doesn’t use slows down others’ ability to complete their work. When you are running a job, there are a few different tools that are available to monitor the efficiency of your job:

  • checkjob <JobID>

    • Can be run after your job completes

    • Will output helpful statistics such as CPU and memory used

    • Good to help right-size your job resources

  • seff <JobID>

    • An alternative to check-job

    • Will also output CPU and memory used

How efficient should my jobs be?

When profiling your jobs, we don’t recommend you go for 100% memory or CPU usage.

This is because the resources required by a job can fluctuate. We don’t want your job to run out of resources and hit an error.

In general, 70% resource utilization for memory and CPU is a good target.

How do I know how many resources I should use?

We recommend conducting a scaling study.

Submit an initial job with enough resources to ensure that your job runs. Then, using the tools above, review your completed job to get an idea of how many resources will be needed for future runs.

Learn more: Slurm on Quest

Test the Impact of Scaling Compute Resources

There is often a temptation to increase the number of cores or amount of RAM in an effort to make your code run faster. But more memory and more CPUs do not always equal more speed.

First, your code must be written in a way to make use of additional cores and memory. This does not happen automatically for most software programs.

If your code can use additional cores or memory, we recommend conducting a scaling study to see the actual effect of using additional resources.

  • Select a set of resources that you know works for the job.

    • For example, 4 cores and 10 GB of memory.

  • Time that job to see how long it takes to complete.

  • Run a few other tests at increments of more resources.

    • Maybe a test at 8 cores and 20 GB of memory and another test at 12 cores and 30 GB of memory.

  • Time these as well to see if you see performance improvement.

Surprisingly, at a certain point additional resources can add overhead that actually slow down a job. This is because of the communication that is required for a job to run on multiple computing resources at the same time.

Learn more: Slurm on Quest

Scratch Space is Spacious, Fast, and Your Friend When Designing I/O Intensive Workflows

Our system provides 5 TB of free scratch storage to any users who apply . This storage can be a great option for intermediate data products.

It is important to note that data in /scratch is deleted after approximately a month. Files that you need to store for longer periods should be transfered to other directories on Quest or moved to other data storage systems.

Learn more: Quest file systems

Not All Software Can Make Use of GPUs

GPUs can provide significant performance improvements for certain types of computational tasks. However, code needs to be able to take advantage of GPUs for optimization.

If you are unsure if your code has the capability to utilize GPUs, it’s good to run a small, single GPU test job and then check the job directly to ensure utilization.

If you are unsure, you can always reach out to our team at quest-help@northwestern.edu .

Learn more: GPUs on Quest

Load Specific Versions of Software Modules

Modules make software available across the HPC cluster.

It may be tempting to just load a module by name. For example, module load matlab.

However, we do not recommend this. This is taking advantage of the default module settings, which can change. We recommend loading a specific version of the module instead. For example, module load matlab/r2023b.

This ensures that you are using the same module for each run of your code, which can help to reduce errors.

Learn more: Quest software module system

Conda/Mamba is a Powerful Tool, but Without Adjustments, It Will Fill Up Your Home Folder

Virtual environments can help you build and install many different applications on an HPC cluster. However, they often fill up a home directory quickly.

We recommend:

  • Use the --prefix argument when building your environment in conda.

    • This points the environment to be created and stored somewhere outside of your home directory so that you can save your home directory space.

  • Change the Conda package cache directory.

    • Conda stores information about many of the packages it installs. Online guides can help to ensure that you don’t fill up your home directory with these files.

Learn more: Conda and Mamba Virtual Environments

Quest Analytics Nodes are Good for Small to Medium Workflows

The Quest Analytics platform provides an easy-to-use web portal for SAS Studio, RStudio, and JupyterHub.

However, the Quest Analytics nodes share resources across all users of each of the limited pool of nodes. This means that if any single user takes up too much memory or CPU it can bring down the process for all users on the node.

To avoid this, sessions are monitored for resource consumption; those that use too many resources are stopped.

It’s important to note that if you are trying to run something on Quest Analytics and it keeps stopping for no known reason, it may be due to these usage restrictions. In that case, it’s a good idea to move your interactive workflow to the Quest OnDemand platform in order to leverage additional computing resources. There are hundreds of times more computational resources available on Quest resources scheduled through Slurm than for the Analytics Nodes.

Learn more: Quest Analytics Nodes