Using Ollama on Quest#

Ollama is a framework that is used to build and run Large Language Models (LLMs) on local machines. To use Ollama and the models it has available on Quest, follow the tutorial below.

This tutorial will show you how to set up a Large Language Model (LLM) on Quest as a batch job, which includes setting up the server connection to Ollama, pull models, and run an example python script using that model.

The sections below demonstrate how to determine the location where you will store your models, how to create a virtual environment with the required software to run your workflow, and an example workflow including a Python script, as well as an example submission script to run your job in batch on Quest.

This article assumes that you are already familiar with python, Quest, virtual environments, and have some knowledge on LLM implementation. If you are not familiar with these topics yet, please refer to the Quest User Guide, Python on Quest, and Virtual Environments on Quest pages.

While Ollama can be used as an interactive chatbot on Quest, you cannot use its chatbot functionalities outside of Quest, as it is not accessible to users who are not apart of Quest. For these reasons, we do not go in depth on how to use Ollama’s chatbot functionalities on Quest in this article. However, if you do require this Ollama function, please reach out to us at quest-help@northwestern.edu to set up a consult with us to go over your workflow.

Storing Ollama Models#

When you pull models from Ollama in your workflow, the model hash will be stored, by default, in your home directory /home/<netid>/. Ollama stores the model hash, which is a digital fingerprint of the full model, so that it can recognize what model to use for following jobs without having to store the entire model on Quest.

While the code used to pull down model hashes to Quest will be discussed in later sections, where you want to store your model hashes should be defined prior to running your workflow. For some bigger models, you may want to consider storing the model hashes somewhere else than in your home directory, since you only have 80GiB of storage in your home directory. To change the location where the model hashes are stored, run the following command:

echo "export OLLAMA_MODELS=/scratch/<netID>/path/to/Ollama-Models" >> $HOME/.bashrc

Instead of saving the models in scratch space, you can also make this location point to a directory in your /projects/pXXXXX folder. The key differences between scratch space and your projects space are:

  • Your scratch space has 5TB of storage, whereas your projects directory has either 1 or 2TB of storage

  • Data stored in your scratch space that has a “last modified” date that surpasses 30 days will automatically be deleted, whereas data stored in your projects directory will not be automatically deleted.

Based on the size of your model and how long you would like to keep it stored on Quest, you can make the decision whether to store it in your scratch space or in your projects directory.

Running the Ollama Server#

Ollama works by using a client-server architecture, which makes it easier for you to set up AI models. Ollama runs on a server, and the Ollama serve command is the core component that initiates the Ollama server process. Once the command ollama serve is ran, it makes the Ollama API available for use by other applications and scripts.

#start Ollama service
ollama serve &> serve_ollama_${SLURM_JOBID}.log &

#wait until Ollama service has been started
sleep 30

The serve_ollama_${SLURM_JOBID}.log log file will contain all information regarding server connection and model runtime. Once the connection to the server has been made, you, the client, can interact with the server to manage the models you want to work with. You can work with these models with commands such as ollama pull and ollama run, which you will see in the Python script.

Creating a Virtual Environment including Ollama#

If you want to use Ollama models in your batch jobs on Quest, you will need to create a virtual environment that contains the Ollama API, as well as all the other packages your Python script requires. To read more about virtual environments, please see our article on mamba or conda virtual environments.

First, load the mamba module on Quest:

$ module load mamba/24.3.0

Next, create a virtual environment with Python, Ollama and any other packages you need. The --prefix argument creates the virtual environment in a specified location, rather than in the default location (/home/<net_id>/.conda/envs/).

$ mamba create --prefix=/projects/p12345/envs/ollama-env -c conda-forge python=3.12

$ mamba activate /projects/p12345/envs/ollama-env 

$ pip install ollama # Ollama needs to be installed with pip for functionality purposes

$ mamba install -c conda-forge pandas matplotlib

Example Workflow#

One way to work with LLMs on Quest is to use the models that are available through Ollama. The workflow outlined below uses a python script called create_stories.py. The full workflow and all files necessary to run this workflow can be found on our GitHub page . It might be helpful to reference this full script while reading through this tutorial, as we will break down parts of the script below.

create_stories.py is a sample script which uses the Ollama package to run a large language model and generate a series of stories based on an author’s style, a genre, and a topic. This template will help you get started on your own projects as well.

Create_stories.toml#

To change configuration options such as the model you are working with or the downsample size, edit the create.stories.toml file.

The system_message path in the create_stories.toml file points to the prompt that you give the model. In the case of this workflow, the prompt is

"You are a master storyteller.
Your task is to create a long story (approximately 15 paragraphs) about {story_topic}.
You must write this story in the literary genre of {story_genre}.
You must write this story in the style of {story_author}.

At the beginning of you story, write a line indicating the author ({story_author} in this case), the genre ({story_genre}), and the topic ({story_topic}). Then start the story."

The paths in the [literary-elements] section are files that contain all the authors, genres, and topics you would like to use to generate these stories.

The [saving] parameter specifies how the generated responses will be saved. save_yn and save_all_yn both take a boolean value of 0 or 1. save_yn decides whether the responses are saved in files sorted by author, and save_all_yn determines whether a file will be created that has all the responses combined.

The [downsampling] parameter allows you to choose if you want to downsample, and if so, how many elements you would like to include from each literary-element. downsample_yn can have a value of 0 or 1, and indicates whether we will downsample or not. If downsample_yn equals 1, then the amount specified by downsample_quantity will be the size of our subsample. Be aware that the subsample cannot be bigger than the amount of literary elements you have in your text files.

Lastly, the [model] parameter is where you specify what model you would like to use for your workflow. This model has to be available through Ollama.

Create_stories.py#

After you have set up the create_stories.toml file, you are ready to run the create_stories.py script.

The first thing you have to do in the create_stories.py script is set up all the packages you need to import in order for your script to run.

# :: IMPORTS ::
import ollama
import pandas as pd
from datetime import datetime
from pathlib import Path
import tomllib
from ollama import Client
import os

For this example workflow, the above packages are necessary. When running your own python script, make sure to import all packages that are needed for the script to run.

Once you have listed the packages to import, the script includes a couple lines of code that establish a connection with a port specified in the submission script. We will go into more depth into this later in the tutorial.

# connect to port
client = Client(
   host="http://localhost:" + os.environ.get("OLLAMA_PORT")
)

we will then load the parameters that you listed in the create_stories.toml file:

# Load parameters and directory info
master_directory = Path("create_stories.toml")
with open(master_directory, "rb") as f:
    config_params = tomllib.load(f)

If your create_stories.toml file is in a different location than the create_stories.py script, you will want to replace Path with the absolute path to your create_stories.toml file.

The following piece of code will get the name of the model that you want to pull from Ollama, the prompt you created for the LLM, as well as the literary elements from the create_stories.toml file. As with the previous block of code, make sure that the files are in the right place to use the Path variable, or otherwise change the variable to an absolute path leading to the files specified in the create_stories.toml file.

# Get model's name (change in create_stories.toml if desired)
llm_model = config_params["model"]["llm_model"]
client.pull(llm_model)

# Load system instructions (found in sysm folder. adapt to your case)
sysm_file = Path(config_params["system"]["system_message"])
with open(sysm_file, "r") as f:
    sysm = f.read()

# Format literaty elements
elements_lists = {}
for element, element_file in config_params["literary-elements"].items():
    element_file = Path(f"{element_file}")
    with open(element_file, "r") as f:
        elements_lists[element] = [x.strip() for x in f.readlines()]

If you specified downsampling parameters in the create_stories.toml file, you can call on them using the block of code below:

# :: DOWNSAMPLE Y/N::
# Only if specified
if config_params["downsampling"]["downsample_yn"]:
    final_elements = {k:v[0:config_params["downsampling"]["downsample_quantity"]] for k,v in elements_lists.items()}
else:
    final_elements = elements_lists

Once we have set all the parameters, we can get to generating the long story based on the authors, topics, and genres specified in the text files listed in the create_stories.toml file. The lines of code below contain functions that will generate the stories as well as save them to a DataFrame. The functions save_per_author and save_all_in_one take the data from the DataFrame and save it in csv files separated by author, as well as a csv file that has all stories combined.

# :: DEFINE GENERATING AND SAVING FUNCTIONS ::
def generate_author_responses(final_elements:dict[list], author:str, sysm:str, llm_model:str):
    responses = []
    for topic in final_elements["topics"]:
            for genre in final_elements["genres"]:
                # Using Ollama. May need to change if using another package.
                response = ollama.generate(
                    model = llm_model,
                    prompt = sysm.format(story_topic = topic, story_genre = genre, story_author = author)
                )
                story = response["response"]
                responses.append({
                    "author": author,
                    "topic": topic,
                    "genre": genre,
                    "story": story + "\n" + "**END**\n"
                })
    responses = pd.DataFrame(responses)
    return responses

def save_per_author(responses_author:pd.DataFrame, author:str, data_directory:Path):
    # Save author responses
    author_file = data_directory / Path(f"response_{author}.csv")
    responses_author.to_csv(author_file, index=False)

def save_all_in_one(responses:pd.DataFrame, data_directory:Path):
    # Appends to csv with all responses. possibly slow bc opening and closing file?
    full_file = data_directory / Path(f"response_all.csv")
    with open(full_file, "a") as f:
        responses.to_csv(f, index = False, header=(f.tell()==0))

Lastly, create_stories.py has a block of code that takes care of generating and saving files. If you would like to change whether the responses are saved by author, all responses are combined into one file, or both, you can change this in the create_stories.toml file.

# :: GENERATE AND SAVE FILES ::
# Create new data_date folder (if it doesn't exist):
dt = datetime.now().strftime('%Y_%m_%d')
data_directory = Path(f"data_out/data_{dt}")
data_directory.mkdir(exist_ok=True)

# Run it - This loop saves after each generation. Maybe slower, but uses less memory than generating all first.
for author in final_elements["authors"]:
    responses = generate_author_responses(final_elements, author, sysm, llm_model)
    if config_params["saving"]["save_yn"]:
        save_per_author(responses, author, data_directory)
    if config_params["saving"]["save_all_yn"]:
        save_all_in_one(responses, data_directory)

Running your LLM script on Quest as a Batch Job#

To run the above script, create_stories.py, as a batch job on Quest, use the submission script submit_create_stories.sh.

First, here is an example of the SBATCH flags you can use to submit this job. It is likely that you don’t have to change anything besides your allocation and email. However, if you make changes to downsample size, model type, or other Ollama-specific variables, the time and amount of GPU cards may need to be altered.

#!/bin/bash
#SBATCH --account=pXXXX  ## YOUR ACCOUNT pXXXX or bXXXX
#SBATCH --partition=gengpu  ### PARTITION (buyin, short, normal, etc)
#SBATCH --nodes=1 ## how many computers do you need 
#SBATCH --ntasks-per-node=4 ## how many cpus or processors do you need on each computer
#SBATCH --job-name=Ollama-batch-job ## When you run squeue -u <NETID> this is how you can identify the job
#SBATCH --time=3:30:00 ## how long does this need to run 
#SBATCH --mem=40GB ## how much RAM do you need per node (this effects your FairShare score so be careful to not ask for more than you need))
#SBATCH --gres=gpu:h100:1 ## type of GPU requested, and number of GPU cards to run on
#SBATCH --output=output-%j.out ## standard out goes to this file
#SBATCH --error=error-%j.err ## standard error goes to this file
#SBATCH --mail-type=ALL ## you can receive e-mail alerts from SLURM when your job begins and when your job finishes (completed, failed, etc)
#SBATCH --mail-user=email@northwestern.edu ## your email, non-Northwestern email addresses may not be supported

If you have any additional questions on how to set up a job or about the SBATCH flags, please refer to this page on the Slurm job scheduler, or specifically on Slurm configuration settings.

The next part of the submission script is a set of helper functions that find a free port on the cluster and set up a port connection so that the Ollama server can be run on it. After we have found a port, these lines of code export the port that we landed on and write it to our output file:

# Find available port to run server on
OLLAMA_PORT=$(find_port localhost 7000 11000)
export OLLAMA_PORT
echo $OLLAMA_PORT

The variable OLLAMA_PORT can then be used in the create_stories.py script to connect to the port:

### From create_stories.py ###
# connect to port
client = Client(
   host="http://localhost:" + os.environ.get("OLLAMA_PORT")
)

Next, we will need to unload any previously loaded modules and load in our modules to use Ollama, as well as our virtual environment:

module purge
module load ollama/0.11.4
module load gcc/12.3.0-gcc
module load mamba/24.3.0

To see what versions of Ollama we currently have available on Quest, use the command

$ module spider ollama

We will then export the IP address so that Ollama can use it. The second export statement is needed so that the IP address can also be used by the actual Ollama application, which is containerized.

export OLLAMA_HOST=0.0.0.0:${OLLAMA_PORT} #what should our IP address be?
export SINGULARITYENV_OLLAMA_HOST=0.0.0.0:${OLLAMA_PORT} 

The next couple lines of code start up the actual Ollama server, create a log file where all server activity is recorded, and make sure that Ollama has enough time to start up by use of a sleep statement:

#start Ollama service
ollama serve &> serve_ollama_${SLURM_JOBID}.log &

#wait until Ollama service has been started
sleep 30

The last steps of the submission script activate the virtual environment and run the python script you want to run. These are really the only lines you will need to edit in this entire script (besides any SBATCH flags or module versions if your code requires it). Make sure that you change the path to your virtual environment as well as the name of your python script if you are using a different script.

# activate virtual environment
eval "$('/hpc/software/mamba/24.3.0/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
source "/hpc/software/mamba/24.3.0/etc/profile.d/mamba.sh"
mamba activate /projects/p12345/envs/ollama-env # Make sure to change the path of this environment to point to where your virtual environment is located

#Run the python script
python -u create_stories.py

If you have any questions regarding this tutorial, please feel free to submit a ticket by emailing quest-help@northwestern.edu .