GPUs in Python#
It is a common researcher use case to require the ability to utilize GPUs in your Python environment. This page covers some of the more common GPU implementations in Python.
Installing and Validating GPU Aware Python Packages#
GPU resources require software designed to manage and take advantage of those resources. Multiple frameworks exist for using GPUs with Python code:
- Jax 
- Tensorflow 
- PyTorch 
- CUpy 
- Rapids 
The sections below demonstrate how to use mamba to create a virtual environment with the required GPU software. Once the GPU software is installed, additional packages should also be installed as needed in the same environment. See the tutorial on Mamba/Conda Virtual Environments for more detail if you’re new to using virtual environments.
Installing these GPU aware Python packages take on the following workflow.
- Determine what version(s) of the package will work for the application being installed 
- Determine what version(s) of CUDA those packages have been built with. If possible, ensure that the package has been compiled with at least CUDA 11.8 so that the software will work on both the A100 and H100 GPUs. 
- After determining the CUDA version, set the environmental variable - CONDA_OVERRIDE_CUDAaccordingly.
- Create the virtual environment. 
- Validate that the GPU aware application can see and use the A100 and/or H100 GPU cards. 
To determine the second piece of information, you can use mamba search . The structure of the search will look as follows (replace the placeholders including the <>):
$ mamba search '<package-name>[channel=conda-forge,subdir=linux-64,build=*cuda*]>=<version-lower>,<=<version-higher>'
Below is an example of following these steps using tensorflow as an example.
$ mamba search 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda*]>=2.13.1,<=2.17.0'
Loading channels: done
# Name                       Version           Build  Channel             
tensorflow                    2.13.1 cuda112py310he87a039_1  conda-forge         
tensorflow                    2.13.1 cuda118py310h189a05f_1  conda-forge         
tensorflow                    2.14.0 cuda118py310h148f8e3_0  conda-forge         
tensorflow                    2.14.0 cuda118py311heb1bdc4_0  conda-forge         
tensorflow                    2.14.0 cuda118py39hc3a5e0e_0  conda-forge         
...
...
tensorflow                    2.17.0 cuda120py312h02ad488_201  conda-forge         
tensorflow                    2.17.0 cuda120py312h02ad488_202  conda-forge         
tensorflow                    2.17.0 cuda120py312h02ad488_203  conda-forge         
tensorflow                    2.17.0 cuda120py39h298b457_203  conda-forge  
The CUDA (and python) versions used to build tensorflow can be found in the “Build” column. The string cuda120py312h02ad488_201 says that this tensorflow is built with Python 3.12 and CUDA Toolkit 12.0.
From this output, tensorflow 2.17.0 has been built with CUDA 12.0 and would be a good candidate to install as it will run on both the A100 and H100. Attempts to install tensorflow without also setting the environmental variable CONDA_OVERRIDE_CUDA to the version of CUDA that tensorflow was built with will result in the following error.
$ mamba create --name=tensorflow 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda*]==2.17.0'
The following package could not be installed
└─ tensorflow >=2.13.1,<=2.17.0 *cuda* is not installable because it requires
   └─ __cuda, which is missing on the system.
An example of setting CONDA_OVERRIDE_CUDA correctly and installing tensorflow into a virtual environment called tensorflow is shown below.
$ CONDA_OVERRIDE_CUDA="12.0" mamba create --name=tensorflow 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda*]==2.17.0'
...
  + tensorflow-base                2.17.0  cuda120py312hbec54f7_203  conda-forge      386MB
  + tensorflow-estimator           2.17.0  cuda120py312hfa0f5ef_203  conda-forge      696kB
  + tensorflow                     2.17.0  cuda120py312h02ad488_203  conda-forge       43kB
...
Jax#
JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning.
Installation#
Example workflow to create a virtual environment named jaxlib. Please run the command that come after the $.
$ module purge  
$ module load mamba/24.3.0
## CUDA 12.0 Built Versions
$ CONDA_OVERRIDE_CUDA=12.0 mamba create --name=jaxlib 'jaxlib[channel=conda-forge,subdir=linux-64,build=*cuda*]>=0.4.26,<=0.4.34'
### CUDA 12.6 Built Versions
$ CONDA_OVERRIDE_CUDA=12.6 mamba create --name=jaxlib 'jaxlib[channel=conda-forge,subdir=linux-64,build=*cuda*]>0.4.34'
Validation#
A short Python script can be used to verify that Jax was installed correct and can see and use the GPU devices on Quest.
test_gpu.py
from jax import extend
print(extend.backend.get_backend().platform)
print(extend.backend.get_backend().platform_version)
print(extend.backend.get_backend().local_devices())
Create the file test_gpu.py.  Then test everything is working correctly with a batch job similar to:
#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4G
module purge
module load mamba/24.3.0
source activate jaxlib
python test_gpu.py
The output from the job above should be similar to:
gpu
PJRT C API
cuda 12030
[CudaDevice(id=0)]
Once the installation is verified, use a similar workflow to run your actual Python code.
Tensorflow#
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools , libraries , and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.
Installation#
Example workflow to create a virtual environment named tensorflow. Please run the command that come after the $.
$ module purge
$ module load mamba/24.3.0
## CUDA 11.8 Built Versions
$ CONDA_OVERRIDE_CUDA=12.0 mamba create --name=tensorflow 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda*]>=2.13.1,<2.15.0'
## CUDA 12.0 Built Versions
$ CONDA_OVERRIDE_CUDA=12.0 mamba create --name=tensorflow 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda*]>=2.15.0,<=2.17.0'
### CUDA 12.6 Built Versions
$ CONDA_OVERRIDE_CUDA=12.6 mamba create --name=tensorflow 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda*]>2.17.0'
Validation#
A short Python script can be used to verify that Tensorflow was installed correct and can see and use the GPU devices on Quest.
test_gpu.py
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))
print("GPUs: ", physical_devices)
print("Tensorflow Built With CUDA Support {0}".format(tf.test.is_built_with_cuda()))
Create the file test_gpu.py.  Then test everything is working correctly with a batch job similar to:
#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4G
module purge
module load mamba/24.3.0
source activate tensorflow
python test_gpu.py
The output from the job above should be similar to:
Num GPUs: 1
GPUs:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Tensorflow Built With CUDA Support True
Once the installation is verified, use a similar workflow to run your actual Python code.
PyTorch#
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration 
- Deep neural networks built on a tape-based autograd system 
Installation#
Example workflow to create a virtual environment named pytorch. Please run the command that come after the $.
$ module purge
$ module load mamba/24.3.0
## CUDA 12.0 Built Versions
$ CONDA_OVERRIDE_CUDA=12.0 mamba create --name=pytorch 'pytorch[channel=conda-forge,subdir=linux-64,build=*cuda*]>=2.0.0,<=2.5.1'
### CUDA 12.6 Built Versions
$ CONDA_OVERRIDE_CUDA=12.6 mamba create --name=pytorch 'pytorch[channel=conda-forge,subdir=linux-64,build=*cuda*]>=2.5.1'
Validation#
A short Python script can be used to verify that PyTorch was installed correct and can see and use the GPU devices on Quest.
test_gpu.py
import torch
print("Num GPUs: ", torch.cuda.device_count())
print("GPUs: ", torch.cuda.get_device_name(0))
print("PyTorch Built With CUDA Support {0}".format(torch.cuda.is_available()))
Create the file test_gpu.py.  Then test everything is working correctly with a batch job similar to:
#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4G
module purge
module load mamba/24.3.0
source activate pytorch
python test_gpu.py
The output from the job above should be similar to:
Num GPUs: 1
GPUs: NVIDIA H100 PCIe
PyTorch Built With CUDA Support True
Validate PyTorch for use with Multiple Nodes with Multiple GPUs#
PyTorch can also be leveraged to run multinode multi-gpu jobs. The batch submission script below shows how to use multiple nodes with multiple GPUs with PyTorch. Please check out the GitHub page that provides the Python code for setting this up.
GitHub Repo: nuitrcs/examplejobs
You can modify the Python code to suit your needs.
#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --time=04:00:00
#SBATCH --job-name=multinode-example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --mem=20G
#SBATCH --cpus-per-task=4
module purge
module load mamba/24.3.0
source activate pytorch
export LOGLEVEL=INFO
srun torchrun \
    --nnodes 2 \
    --nproc_per_node 2 \
    --rdzv_id $RANDOM \
    --rdzv_backend c10d \
    --rdzv_endpoint "$SLURMD_NODENAME:29500" \
    ./multinode_torchrun.py 10000 100
CUpy#
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.
All instructions will utilize the software management utility mamba. Please see Using Python on QUEST for more information on Mamba virtual environments.
Installation#
Please run the command that come after the $.
$ module purge
$ module load mamba/24.3.0
$ mamba create --name=cupy cupy 
Validation#
A short Python script can be used to verify that CUpy was installed correct and can see and use the GPU devices on Quest.
test_gpu.py
import cupy as cp
x_gpu = cp.array([1, 2, 3])
l2_gpu = cp.linalg.norm(x_gpu)
print(x_gpu)
print(l2_gpu)
Create the file test_gpu.py.  Then test everything is working correctly with a batch job similar to:
#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4G
module purge
module load mamba/24.3.0
source activate cupy
python test_gpu.py
The output from the job above should be similar to:
[1 2 3]
3.7416573867739413
Once the installation is verified, use a similar workflow to run your actual Python code.
Rapids#
These install instructions should work for rapidsai v24.10. It may work for newer releases of rapidsai but that will depend on the version of CUDA that was used to compile the conda package. If the version is older than CUDA 12.0 and newer than CUDA 12.8, it will not work with the Quest GPUs.
Installation#
Please run the command that come after the $.
$ module purge
$ module load mamba/24.3.0
$ mamba create -n rapids-25.04 -c rapidsai -c conda-forge -c nvidia  \
    rapids=25.04 python=3.12 'cuda-version>=12.0,<=12.8'
Note: Please see getting started with rapids for more details.
Validation#
A short Python script can be used to verify that CUpy was installed correct and can see and use the GPU devices on Quest.
test_gpu.py
import cudf
import pandas as pd
import numpy as np
rng = np.random.default_rng(seed=0)
num_rows = 1_000_000
pdf = pd.DataFrame(
    {
        "numbers": rng.integers(-1000, 1000, num_rows, dtype="int64"),
        "business": rng.choice(
            ["McD", "Buckees", "Walmart", "Costco"], size=num_rows
        ),
    }
)
gdf = cudf.from_pandas(pdf)
print(gdf.value_counts())
Create the file test_gpu.py.  Then test everything is working correctly with a batch job similar to:
#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=12G
module purge
module load mamba/24.3.0
source activate rapids-25.04
python test_gpu.py
The output from the job above should be similar to:
numbers  business
-452     McD         169
 785     McD         165
 997     Walmart     164
 368     McD         162
-792     Buckees     162
                    ... 
 26      Costco       90
 308     Walmart      90
-320     Walmart      85
-934     Walmart      85
 692     Costco       82
Name: count, Length: 8000, dtype: int64
Once the installation is verified, use a similar workflow to run your actual Python code.
