Using Ollama on Quest#
Ollama is a framework for building and running large language models (LLMs) on local machines.
Warning
Ollama provides access to open source models that anyone can contribute to. Only use models from trusted or verified sources, as some models contain malicious code. The Ollama infrastructure can also create security vulnerabilities if you run it on your own computer.
This tutorial explains how to run a large language model (LLM) on Quest as a batch job. You will:
Set up the Ollama server
Pull a model
Run an example Python script
Before starting, you must be familiar with Python, Quest, virtual environments, and basic large language models (LLMs) usage. If not, see the Quest User Guide, Python on Quest, and Virtual Environments on Quest pages.
Replace placeholders in this tutorial as follows:
<netID>: your NetID<account_id>: your Quest allocation (e.g.,pXXXX)
Note
While you can use Ollama to run an interactive chatbot on Quest, the chatbot will not be accessible outside of your Quest account. For these reasons, we do not cover Ollama’s chatbot functionalities in this tutorial. If you require this functionality, contact quest-help@northwestern.edu to schedule a consultation about your workflow.
What You Will Learn#
After completing this tutorial, you will be able to:
Run an Ollama model within a Slurm batch job on Quest
Store and reuse models across jobs using shared storage
Configure a Python script to communicate with an Ollama server
Submit and manage batch jobs that use LLMs through Ollama
Prerequisites#
Before starting, ensure you have:
An active Quest allocation (e.g.,
pXXXX)Access to GPU partitions (if using GPU models); General Access Quest allocations have access to GPU partitions
Familiarity with Slurm job submission
Create a Virtual Environment Including Ollama#
To use Ollama models in batch jobs on Quest, create a mamba or conda virtual environment with:
The Ollama Python API
Any additional packages required by your script
The Python API allows your script to communicate with a running Ollama server.
The Python API runs in your virtual environment
The Ollama server is provided through a Quest module and runs separately
These components communicate over a local network port during job execution.
Connect to the Quest login nodes and load the
mambamodule:
$ module load mamba/24.3.0
Create and activate a virtual environment with mamba. The
--prefixargument sets a custom environment location instead of the default (/home/<net_id>/.conda/envs/). Update the command with the path to your/projectsdirectory rather than<account_id>. This example creates an environment calledollama-env.
$ mamba create --prefix=/projects/<account_id>/envs/ollama-env -c conda-forge python=3.12
$ mamba activate /projects/<account_id>/envs/ollama-env
Install the Ollama Python API. Installing Ollama with
pipprovides only the Python client library for sending requests to a server; it does not install or start the server.
Later in this tutorial, you will load the Ollama module and start the Ollama server separately. Your Python script will connect to that running server through the Python API installed here.
$ pip install ollama
Install any additional Python packages required by your workflow. Replace or extend this list as needed for your script.
$ mamba install -c conda-forge pandas matplotlib
Store Ollama Models#
When you pull a model, Ollama downloads and stores it on disk, managing files internally using content hashes (similar to container image layers), but the underlying model data can be several gigabytes in size. The storage location is determined by the OLLAMA_MODELS environment variable.
By default, Ollama stores models in your home directory (/home/<netid>/). Because home directories on Quest have a storage limit of 80 GB, this location is not suitable for larger models or workflows that require multiple models.
To store models in a different location, set the OLLAMA_MODELS environment variable to a directory in /scratch or /projects. Scratch (/scratch) is best for very large models or temporary workflows as it has a 5 TB quota and files are automatically deleted after 30 days of inactivity. Projects (/projects) is best for long-term storage and workflows that are shared with others with access to your Quest allocation.
To store models in scratch space, open a terminal, log in to Quest, and run the following commands:
mkdir -p /scratch/<netID>/ollama-models
echo "export OLLAMA_MODELS=/scratch/<netID>/ollama-models" >> $HOME/.bashrc
After updating your ~/.bashrc, log out and back in before submitting jobs to ensure the variable is applied in Slurm jobs. Alternatively, you can run:
source ~/.bashrc
Alternatively, you can use your /projects/<account> directory instead of /scratch.
Example Workflow#
This example demonstrates an end-to-end workflow for generating stories with Ollama on Quest. At a high level, the workflow consists of:
Define configuration and inputs: Specify the model, system prompt, input files, and output behavior in
create_stories.toml.Define the Python client script: Import required packages and define how the Python script will connect to the Ollama server when the workflow runs.
Configure the client using
create_stories.toml: Load settings from the TOML file, load the system prompt, and read the author, genre, and topic files.Generate stories using the LLM: Use the Ollama model to generate stories for each author-genre-topic combination.
Save outputs to disk: Write generated stories to CSV files based on the saving options defined in the configuration.
The full workflow and all necessary files can be found on our GitHub page . Below, only sections of the files are included to discuss specific components. There are three main files:
create_stories.py: The Python script that does the main work.create_stories.toml: Configuration settings that the Python script uses.submit_create_stories.sh: The Slurm job submission script, which sets up the Ollama server and then runs the Python script.
As well as supporting files in the sysm/ directory.
Run this workflow with a batch job or in an interactive session, not on a Quest Login Node, as running resource-intensive workloads on login nodes can negatively impact other users and may result in your processes being terminated.
Step 1: Define Configuration and Inputs#
Workflow behavior is controlled by a configuration file named create_stories.toml. This file defines:
Which Ollama model to use
The system prompt given to the model
The input files containing authors, genres, and topics
Whether outputs are saved by author, combined into a single file, or both
Whether inputs are downsampled
An example configuration file is shown below:
File: create_stories.toml
[literary-elements]
authors = "sysm/authors.txt"
genres = "sysm/genres.txt"
topics = "sysm/topics.txt"
[system]
system_message = "sysm/sysm_create_stories.txt"
[saving]
save_yn = 1
save_all_yn = 1
[downsampling]
downsample_yn = 1
downsample_quantity = 2
[model]
llm_model = "llama3.2"
Ensure all file paths in create_stories.toml are correct relative to your working directory, or use absolute paths.
Each section in the create_stories.toml file serves a specific purpose:
[system] - Defines the prompt given to the model
system_messagepoints to a text file containing the instructions the model follows when generating storiesIn this workflow, the prompt instructs the model to write a story based on an author, genre, and topic. The prompt instructions are in the file
sysm/sysm_create_stories.txtwhich is shown below.
[literary-elements] – Specifies the inputs used to generate stories
Paths to the text files that contain:
Authors
Genres
Topics
The script combines these elements to generate different story variations
[saving] - Controls how generated responses are saved
save_yn: Set to0or1to save responses in files grouped by authorsave_all_yn: Set to0or1to save all responses in a combined file
[downsampling] - Controls whether the workflow uses a subset of the available inputs
Downsampling limits the number of authors, genres, and topics used in a run
When enabled, the script selects the first N elements from each input file
downsample_yn: Set to0(no) or1(yes) to choose whether to downsampledownsample_quantity: Number of items to include in the subsample if downsampling is enabledNote: The subsample cannot exceed the number of literary elements in your files
[model] - Specifies the Ollama model used for text generation
llm_model: Name of the Ollama model to pull and run
The system prompt:
File: sysm/sysm_create_stories.txt
"You are a master storyteller.
Your task is to create a long story (approximately 15 paragraphs) about {story_topic}.
You must write this story in the literary genre of {story_genre}.
You must write this story in the style of {story_author}.
At the beginning of your story, write a line indicating the author ({story_author} in this case), the genre ({story_genre}),
and the topic ({story_topic}). Then start the story."
Step 2: Define the Python Client Script#
After defining create_stories.toml, create the Python script that acts as a client to the Ollama server. This script does not start the Ollama server; it connects to a server started later as part of a batch job.
The first step in create_stories.py is to import the required Python packages.
File: create_stories.py
# :: IMPORTS ::
import ollama
import pandas as pd
from datetime import datetime
from pathlib import Path
import tomllib
from ollama import Client
import os
This Python script requires the packages shown above. When adapting this script for your own workflow, ensure that all required packages are installed and imported.
Find and Export a Free Port#
Ollama runs as a local server on the compute node and must listen on an available port. The submission script finds an available port using the find_port function included in the submission script and then exports that value as an environment variable that the Python client uses to connect.
File: submit_create_stories.sh
OLLAMA_PORT=$(find_port localhost 7000 11000)
export OLLAMA_PORT
echo $OLLAMA_PORT
The Python script then uses the environment variable OLLAMA_PORT to connect to the server.
File: create_stories.py
# Connect to port
client = Client(
host="http://localhost:" + os.environ.get("OLLAMA_PORT")
)
At this stage, the client connection is defined but no requests are sent. The Ollama server itself will be started when the batch job runs.
Step 3: Configure the Client Using create_stories.toml#
Once the Python client is set up, the script loads the configuration and input files defined in create_stories.toml.
File: create_stories.py
# Load parameters and directory info
master_directory = Path("create_stories.toml")
with open(master_directory, "rb") as f:
config_params = tomllib.load(f)
If create_stories.toml is located in a different directory than create_stories.py, replace the argument to Path with the absolute path to the configuration file create_stories.toml.
Pull the Specified Model From Ollama#
The call to client.pull() ensures that the specified model is available. If the model is not already in your OLLAMA_MODELS directory, it will be downloaded at this step during job execution.
Model downloads can take significant time and disk space (several GB). Ensure your job time and storage allocation are sufficient for the initial pull.
File: create_stories.py
# Get model's name from the configuration file
llm_model = config_params["model"]["llm_model"]
# Pull the specified model from Ollama (if it is not already available)
client.pull(llm_model)
# Load system instructions from the prompt file specified in create_stories.toml
sysm_file = Path(config_params["system"]["system_message"])
with open(sysm_file, "r") as f:
sysm = f.read()
In this workflow, the system prompt file is stored in a directory named sysm/. This directory name is not required and is used only as an example for organizing prompt files. The script loads the system prompt from the file path specified in the [system] section of create_stories.toml.
The script also loads the author, genre, and topic inputs defined in the [literary-elements] section.
File: create_stories.py
# Format literary elements
elements_lists = {}
for element, element_file in config_params["literary-elements"].items():
element_file = Path(f"{element_file}")
with open(element_file, "r") as f:
elements_lists[element] = [x.strip() for x in f.readlines()]
If downsampling is enabled in create_stories.toml, the script reduces the number of input values before generation.
File: create_stories.py
# Apply downsampling if enabled
if config_params["downsampling"]["downsample_yn"]:
final_elements = {k:v[0:config_params["downsampling"]["downsample_quantity"]] for k,v in elements_lists.items()}
else:
final_elements = elements_lists
Step 4: Generate Stories Using the LLM#
Once all parameters are set, the script generates the stories using the Ollama model. The code below defines functions to:
Generate stories (
generate_author_responses)For each author, genre, and topic combination
Based on the prompt you defined in the TOML file
Using the Ollama model specified by
llm_model
Save results by author (
save_per_author)If
save_yn = 1increate_stories.toml, each author’s stories are saved in a separate CSV file
Save all results in a combined file (
save_all_in_one)If
save_all_yn = 1increate_stories.toml, all generated stories are appended to a single CSV fileThe combined file uses header logic to avoid duplication on repeated runs
File: create_stories.py
# :: DEFINE GENERATING AND SAVING FUNCTIONS ::
def generate_author_responses(final_elements:dict[list], author:str, sysm:str, llm_model:str):
responses = []
for topic in final_elements["topics"]:
for genre in final_elements["genres"]:
# Using the Ollama API; modify if using a different client library.
response = ollama.generate(
model = llm_model,
prompt = sysm.format(
story_topic = topic,
story_genre = genre,
story_author = author)
)
story = response["response"]
responses.append({
"author": author,
"topic": topic,
"genre": genre,
"story": story + "\n" + "**END**\n"
})
responses = pd.DataFrame(responses)
return responses
Step 5: Save Generated Outputs#
The final block calls these functions to generate and save files. To change how responses are saved (by author, combined, or both), update the settings in create_stories.toml.
File: create_stories.py
def save_per_author(responses_author:pd.DataFrame, author:str, data_directory:Path):
# Save author responses
author_file = data_directory / Path(f"response_{author}.csv")
responses_author.to_csv(author_file, index=False)
def save_all_in_one(responses:pd.DataFrame, data_directory:Path):
# Appends to a CSV file with all responses. This may be slower due to repeatedly opening and closing the file.
full_file = data_directory / Path(f"response_all.csv")
with open(full_file, "a") as f:
responses.to_csv(f, index = False, header=(f.tell()==0))
Generated CSV files will be saved in a dated folder under data_out/ within your working directory.
File: create_stories.py
# :: GENERATE AND SAVE FILES ::
# Create new data_date folder (if it doesn't exist):
dt = datetime.now().strftime('%Y_%m_%d')
data_directory = Path(f"data_out/data_{dt}")
data_directory.mkdir(exist_ok=True)
# Run it - This loop saves after each generation. This may be slower, but it uses less memory than generating all responses first.
for author in final_elements["authors"]:
responses = generate_author_responses(final_elements, author, sysm, llm_model)
if config_params["saving"]["save_yn"]:
save_per_author(responses, author, data_directory)
if config_params["saving"]["save_all_yn"]:
save_all_in_one(responses, data_directory)
Run the Workflow on Quest#
To run create_stories.py as a batch job on Quest, use the submission script submit_create_stories.sh. This script handles starting the Ollama server, setting up networking, and running your Python workflow on a compute node.
To submit the job, use the command
sbatch submit_create_stories.sh
Slurm Job Configuration#
Begin by specifying the appropriate SBATCH options for your job. In most cases, you only need to update your allocation and email address. If you change the downsample size, model type, or other Ollama-specific parameters, adjust the job duration or GPU request as needed.
Some Ollama models require GPUs for efficient execution. For smaller models, CPU-only jobs may be sufficient.
File: submit_create_stories.sh
#!/bin/bash
#SBATCH --account=pXXXX ## YOUR ACCOUNT pXXXX or bXXXX
#SBATCH --partition=gengpu ### PARTITION (buyin, short, normal, etc)
#SBATCH --nodes=1 ## how many computers do you need
#SBATCH --ntasks-per-node=4 ## CPU cores
#SBATCH --job-name=Ollama-batch-job ## When you run squeue -u <NETID> this is how you can identify the job
#SBATCH --time=3:30:00 ## how long does this need to run
#SBATCH --mem=40GB ## how much RAM do you need per node (this affects your FairShare score, so be careful to not ask for more than you need)
#SBATCH --gres=gpu:1 ## type of GPU requested, and number of GPU cards to run on
#SBATCH --output=output-%j.out ## standard out goes to this file
#SBATCH --error=error-%j.err ## standard error goes to this file
#SBATCH --mail-type=ALL ## you can receive email alerts from SLURM when your job begins and when your job finishes (completed, failed, etc)
#SBATCH --mail-user=email@northwestern.edu ## your email, non-Northwestern email addresses may not be supported
For more information on job setup or SBATCH flags, please refer to this page on the Slurm job scheduler, or specifically on Slurm configuration settings.
If your workflow requires a specific GPU type, such as A100 or H100, request it explicitly. For example, to request one A100 GPU card:
#SBATCH --gres=gpu:a100:1
For more details, please refer to the GPUs on Quest page.
Find and Export a Free Port#
Ollama runs as a local server on the compute node and must listen on an available port. The submission script finds an available port using the find_port function included in the submission script and then exports that value as an environment variable that the Python client uses to connect.
File: submit_create_stories.sh
OLLAMA_PORT=$(find_port localhost 7000 11000)
export OLLAMA_PORT
echo $OLLAMA_PORT
The Python script then uses the environment variable OLLAMA_PORT to connect to the server.
File: create_stories.py
# Connect to port
client = Client(
host="http://localhost:" + os.environ.get("OLLAMA_PORT")
)
Load Required Modules and Setting Environment Variables#
On Quest, loaded modules affect available libraries and compilers. Always purge previously loaded modules to avoid conflicts.
File: submit_create_stories.sh
module purge
module load ollama/0.11.4
module load gcc/12.3.0-gcc
module load mamba/24.3.0
When running your own code, you can update the Ollama version if needed. To see all available Ollama module versions available on Quest, run:
$ module spider ollama
Next, export the host address so the Ollama server can listen for client requests. The second export makes the variable available inside the container used by the Ollama module.
File: submit_create_stories.sh
export OLLAMA_HOST=0.0.0.0:${OLLAMA_PORT}
export SINGULARITYENV_OLLAMA_HOST=0.0.0.0:${OLLAMA_PORT}
Start the Ollama Server#
With the module loaded and networking configured, start the Ollama server. All server activity is written to a log file for later inspection.
File: submit_create_stories.sh
# Start Ollama service
ollama serve &> serve_ollama_${SLURM_JOBID}.log &
sleep 30
After starting the server, the sleep command builds in time allow the server to initialize before the Python client connects. If your script fails to connect to the server, check the file serve_ollama_<jobid>.log for startup errors. Increase the sleep duration if the server is not ready in time.
The log file serve_ollama_<jobid>.log will contain details about server startup, model loading, and runtime behavior.
Activate the Virtual Environment#
We now activate the virtual environment created earlier. These commands initialize the mamba environment inside the Slurm job. All three lines of code are required for proper activation within batch jobs. Update the environment path if you created it in a different location, and change the script name if needed.
File: submit_create_stories.sh
# Activate virtual environment
eval "$('/hpc/software/mamba/24.3.0/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
source "/hpc/software/mamba/24.3.0/etc/profile.d/mamba.sh"
mamba activate /projects/<account_id>/envs/ollama-env
Run the Script#
Finally, include a line to run the Python script.
File: submit_create_stories.sh
# Run the Python script
python -u create_stories.py
The Python script connects to the running Ollama server, pulls the specified model if needed, generates stories, and writes the output files to disk.
After the job completes, check the data_out/ directory for generated CSV files. You can also review the Slurm output and error logs to confirm successful execution.
Troubleshooting#
Common issues:
Model fails to download: Check available storage in the
OLLAMA_MODELSlocation and the time limit for the batch jobConnection errors: Verify
OLLAMA_PORTand increase the sleep time after the server is startedModule errors: Ensure
module purgeand required modules are loadedMissing files: Verify file paths in
create_stories.toml
Check Slurm output (output-<jobid>.out) and error (error-<jobid>.err) files for details.