Using Hugging Face on Quest#
Hugging Face provides tools for working with machine learning, including large language models (LLMs). Using the Hugging Face API and the transformers library, you can load, train, and fine-tune models on Quest.
Warning
Since Hugging Face is a repository for models that anyone can contribute to, use only models from trusted or verified sources. Some models contain malicious code. Pickle and other serialization formats are common carriers for harmful code.
In this tutorial, you will:
Download and cache Hugging Face models on Quest
Run a fine-tuning workflow in Python
Submit a Slurm batch job using GPUs
Before starting, you must be familiar with Python, Quest, virtual environments, and basic large language models (LLMs) usage. If not, see the Quest User Guide, Python on Quest, and Virtual Environments on Quest pages.
Replace placeholders in this tutorial as follows:
<netID>: your NetID<account_id>: your Quest allocation (e.g.,pXXXX)
What You Will Learn#
After completing this tutorial, you will be able to:
Run a Hugging Face model within a Slurm batch job on Quest
Store and reuse models across jobs using shared storage
Submit and manage batch jobs that use LLMs through Hugging Face
Prerequisites#
Before starting, ensure you have:
An active Quest allocation (e.g.,
pXXXX)
Access to GPU partitions (if using GPU models); General Access Quest allocations have access to GPU partitions
Familiarity with Slurm job submission
Create a Virtual Environment Including Hugging Face#
To use Hugging Face models in batch jobs on Quest, create a mamba or conda virtual environment with:
The Transformers package, which provides tools for loading, training, and fine-tuning models
Any additional packages required by your script
Connect to the Quest login nodes and load the
mambamodule:
$ module load mamba/24.3.0
Create and activate a virtual environment with mamba. The
--prefixargument sets a custom environment location instead of the default (/home/<net_id>/.conda/envs/). Update the command with the path to your/projectsdirectory rather than<account_id>. This example creates an environment calledhuggingface-env. Note that Python version 3.10 is recommended as Python > 3.12 may cause import errors with packages.
$ mamba create --prefix=/projects/<account_id>/envs/huggingface-env -c conda-forge -c pytorch -c nvidia python=3.10 pytorch pytorch-cuda=12.1
$ mamba activate /projects/<account_id>/envs/huggingface-env
Install Hugging Face Transformers and its required dependencies using mamba. This installs the core Hugging Face libraries used in the script. Transformers is explicitly pinned to a stable 4.x release to ensure compatibility.
$ mamba install -c conda-forge "transformers>=4.30,<5>" huggingface_hub accelerate tokenizers
Install supporting libraries used for data handling, evaluation, and visualization in the Python script. Also install any packages you want to use with your code.
$ mamba install -c conda-forge datasets evaluate scikit-learn matplotlib
Before running the tutorial script, verify the installation by opening Python in the terminal and running the command below. If the command executes without error, everything is installed correctly.
$ python
> from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
pipeline,
)
Store Hugging Face Models#
To work with the models, you must download the model to Quest. The example script below pulls the model onto Quest from Hugging Face when the script is run.
When you download a model from Hugging Face, it stores cached model files (identified by unique hashes), which can range from several hundred megabytes to multiple gigabytes per model. The location of these files is determined by the HF_HOME environment variable.
By default, Hugging Face stores models in your home directory (/home/<netid>/). Because home directories on Quest have a storage limit of 80 GB, this location is not suitable for larger models or workflows that require multiple models.
To store models outside of your home directory, set the HF_HOME environment variable to a directory in /scratch or /projects. Scratch (/scratch) is best for very large models or temporary workflows as it has a 5 TB quota and files are automatically deleted after 30 days of inactivity. Projects (/projects) is best for long-term storage and workflows that are shared with others with access to your Quest allocation.
For example, to store models in scratch space, open a terminal, log in to Quest, and run the following commands:
mkdir -p /scratch/<netID>/HF/.cache/huggingface
echo "export HF_HOME=/scratch/<netID>/HF/.cache/huggingface" >> $HOME/.bashrc
After updating your ~/.bashrc, log out and back in before submitting jobs to ensure the variable is applied in Slurm jobs. Alternatively, you can run:
source ~/.bashrc
You can edit the above command to store models in /projects/<account> instead of in /scratch.
Example Workflow#
The following Python script (fine_tuning.py) downloads, fine-tunes, and uses a model with the IMDB sentiment dataset. Run this script with a batch job or in an interactive session, not on a Quest Login Node, as running resource-intensive workloads on login nodes can negatively impact other users and may result in your processes being terminated.
The IMDB dataset contains movie reviews labeled as positive or negative and is commonly used for sentiment analysis. The model’s objective is to correctly categorize the movies in the dataset as either positive or negative.
The full workflow and all files necessary to run this workflow can be found in our GitHub example scripts repository . fine_tuning.py is shown in sections below; the full file is available in the repository.
Import Packages#
File: fine_tuning.py
### Imports ###
# HuggingFace libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline
from datasets import load_dataset, concatenate_datasets
import evaluate
# Machine learning
from sklearn.metrics import (
confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score
)
# Misc
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
Load Data#
Load the IMDB data from the datasets library. For this workflow, we use only a small subset of the full dataset. This approach includes:
Sampling each label separately to avoid selecting only negative (0) reviews, which appear first in the dataset
Selecting a fixed number of positive and negative examples to create a balanced training set
Randomizing the order to avoid label ordering effects
We use a small, fixed number of samples in this tutorial to keep training time short and make the workflow suitable for demonstration purposes. For real model training, you should increase the dataset size.
The datasets library loads data lazily, meaning it does not load the entire IMDB dataset into memory at once. This allows us to filter and sample the data efficiently without requiring the full dataset to be read at the same time.
File: fine_tuning.py
### Load Data ###
dataset = load_dataset("imdb", split="train")
neg_dataset = dataset.filter(lambda x: x['label'] == 0).select(range(250))
pos_dataset = dataset.filter(lambda x: x['label'] == 1).select(range(250))
dataset = concatenate_datasets([neg_dataset, pos_dataset]).shuffle(seed=42)
print(Counter(dataset["label"]))
Create the Model and Tokenize the Data#
For this step, we load:
A tokenizer to convert text into model inputs
A pre-trained model for classification
Load the Tokenizer#
First, we load the tokenizer associated with the chosen pre-trained model. The tokenizer converts text into model inputs. When we call AutoTokenizer.from_pretrained(model_name), Hugging Face automatically downloads the tokenizer files if they are not already available locally, and loads its vocabulary and configuration.
Create the Classification Model#
Next, we create the classification model using AutoModelForSequenceClassification.from_pretrained. This call downloads the pre-trained model weights from Hugging Face and initializes a sequence-classification model with two output labels. This model is then ready to be fine-tuned on the IMDB dataset.
File: fine_tuning.py
### Create Model and Train ###
# Model Name
model_name = "distilbert-base-uncased"
# Tokenizing
tokenizer = AutoTokenizer.from_pretrained(model_name)
def preprocess(text_sample):
return tokenizer(text_sample["text"], truncation=True, padding="max_length", max_length=128)
tokenized_dataset = dataset.map(preprocess, batched=True)
# Create Fine-Tuned Classification Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
A max_length of 128 tokens is used here as a balanced choice between training speed and model performance. Larger values allow longer text inputs but require more GPU memory and longer training times.
The model variable above loads a pre-trained Hugging Face model (DistilBERT) for classification. This is the base model that the trainer will fine-tune. Fine-tuning will alter the model slightly so that it is more specific to our data.
Choose an Evaluation Metric#
To train the model, we need to choose a metric to evaluate the model performance. Here, we use accuracy as a simple example. Accuracy measures the percentage of correct predictions. For each prediction, we compare the model’s output to the true label: a correct prediction counts as 1, and an incorrect prediction counts as 0. After evaluating all samples, the percentage of correct predictions becomes the model’s overall accuracy.
File: fine_tuning.py
# Define accuracy metric
accuracy = evaluate.load("accuracy")
def compute_metrics(pred):
preds = np.argmax(pred.predictions, axis=1)
return accuracy.compute(predictions=preds, references=pred.label_ids)
Train the Model#
The most important training parameters are:
per_device_train_batch_size– controls samples per iteration. Larger batch sizes require more memory, while very small batches can lead to noisy learning. For Quest, a batch size of 4 works well.num_train_epochs– controls how many times the dataset is processed. Once all batches are processed, you will have completed one epoch. For example, with the settings above and using 100 data points, you will have 25 batches (four data points per iteration) and one epoch.
File: fine_tuning.py
# Training setup
training_args = TrainingArguments(
output_dir="./test_fine_tuning",
per_device_train_batch_size=4,
num_train_epochs=1,
logging_steps=10,
save_steps=1000,
push_to_hub=False, # disable pushing to HF
report_to="none", # disable logging to wandb (weights and biases) and other places - otherwise it asks for API key
)
trainer = Trainer( # uses the objects and functions created above
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
The trained model and logs are saved in the directory specified by output_dir (for example, ./test_fine_tuning). Update this path if you want to store outputs in /scratch or /projects.
Start training the model:
File: fine_tuning.py
trainer.train()
Evaluate the Model#
After training, we evaluate the model. Evaluation includes:
Loading the IMDB test split to evaluate the model on unseen data
Sampling 250 positive and 250 negative reviews to match the balanced training set
Tokenizing the sample reviews using the same processing function
Running the evaluation step with
trainer.evaluate(tokenized_test)
We downsample the test split in the same way as the training split to maintain a balanced evaluation set and to ensure the tutorial runs quickly.
File: fine_tuning.py
### Evaluate ###
test_dataset = load_dataset("imdb", split="test")
# Filter 250 negative and 250 positive samples
test_neg = test_dataset.filter(lambda x: x['label'] == 0).select(range(250))
test_pos = test_dataset.filter(lambda x: x['label'] == 1).select(range(250))
test_dataset = concatenate_datasets([test_neg, test_pos]).shuffle(seed=42)
# Tokenize
tokenized_test = test_dataset.map(preprocess, batched=True)
# Evaluate
trainer.evaluate(tokenized_test)
The output of trainer.evaluate(tokenized_test) should look similar to the following; the specific numerical values will differ.
{'eval_loss': 0.4489305317401886,
'eval_accuracy': 0.79,
'eval_runtime': 0.7058,
'eval_samples_per_second': 708.383,
'eval_steps_per_second': 89.256,
'epoch': 1.0}
The output returned by trainer.evaluate(tokenized_test) reports several metrics that summarize model performance. The most important values are eval_loss and eval_accuracy:
The evaluation loss (
eval_loss) measures how well the model’s predictions match the true labels, with lower values indicating better performance.The evaluation accuracy (
eval_accuracy) shows the fraction of test samples the model classified correctly. In this example, the model achieves an accuracy of 0.79, meaning it correctly labeled 79% of the test reviews.
The remaining fields provide performance and runtime information, including the total evaluation time (eval_runtime), the number of samples processed per second, and the number of evaluation steps performed per second.
Finally, the epoch value indicates after which training epoch the evaluation was performed. Together, these metrics provide a concise summary of both the model’s predictive performance and the computational cost of evaluation.
Predict Labels#
After training, use the model to predict labels. The first step is to create a classification pipeline with our model.
File: fine_tuning.py
### Predict ###
classif_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
Use the classification pipeline to check whether a negative, positive, and ambivalent reviews are correctly predicted:
File: fine_tuning.py
classif_pipeline("Wow, this is the worst movie I've seen in my entire life")
>>> [{'label': 'LABEL_0', 'score': 0.8197433948516846}]
classif_pipeline("Wow, this is the best movie I've seen in my entire life")
>>> [{'label': 'LABEL_1', 'score': 0.897377610206604}]
classif_pipeline("Wow, I don't know how to feel about it")
>>> [{'label': 'LABEL_0', 'score': 0.605690598487854}]
The negative and positive reviews are correctly predicted, with high confidence. The ambivalent review, on the other hand, got predicted as negative with a relatively low confidence score.
Running Your LLM Script on Quest as a Batch Job#
To run the above script, fine_tuning.py, as a batch job on Quest, use the submission script submit-fine-tuning.sh. The full script is available in the GitHub example scripts repository . The sections below discuss the components of the job submission script.
To submit the job, use the command
sbatch submit-fine-tuning.sh
Slurm Job Configuration#
Use the following SBATCH options to submit the job. Usually, you only need to update your allocation and email. If you change the batch size, model type, or other Hugging Face-specific settings, adjust GPU and time requirements accordingly.
This example benefits from GPUs but can run on CPUs with longer runtime.
File: submit-fine-tuning.sh
#!/bin/bash
#SBATCH --account=pXXXX ## YOUR ACCOUNT pXXXX or bXXXX
#SBATCH --partition=gengpu ### PARTITION
#SBATCH --nodes=1 ## how many computers do you need
#SBATCH --ntasks-per-node=4 ## CPU cores
#SBATCH --job-name=HuggingFace-batch-job ## When you run squeue -u <NETID> this is how you can identify the job
#SBATCH --time=3:30:00 ## how long does this need to run
#SBATCH --mem=40GB ## how much RAM do you need per node (this affects your FairShare score, so be careful to not ask for more than you need)
#SBATCH --gres=gpu:1 ## type of GPU requested, and number of GPU cards to run on
#SBATCH --output=output-%j.out ## standard out goes to this file
#SBATCH --error=error-%j.err ## standard error goes to this file
#SBATCH --mail-type=ALL ## you can receive email alerts from SLURM when your job begins and when your job finishes (completed, failed, etc)
#SBATCH --mail-user=email@northwestern.edu ## your email, non-Northwestern email addresses may not be supported
For more information on job setup or SBATCH flags, please refer to this page on the Slurm job scheduler, or specifically on Slurm configuration settings.
If your workflow requires a specific GPU type, either A100 or H100, you can request it explicitly. For example, to request one A100 GPU card:
#SBATCH --gres=gpu:a100:1
For more information on GPU specifications, please refer to the GPUs on Quest page.
Load Required Modules#
On Quest, loaded modules affect available libraries and compilers. Always purge previously loaded modules to avoid conflicts.
File: submit-fine-tuning.sh
module purge
module load mamba/24.3.0
Activate the Virtual Environment#
Activate the virtual environment created earlier. These commands initialize the mamba environment inside the Slurm job. All three lines of code are required for proper activation within batch jobs. Update the environment path if you created it in a different location, and change the script name if needed.
File: submit-fine-tuning.sh
# activate virtual environment
eval "$('/hpc/software/mamba/24.3.0/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
source "/hpc/software/mamba/24.3.0/etc/profile.d/mamba.sh"
mamba activate /projects/p12345/envs/huggingface-env # Make sure to change the path of this environment to point to where your virtual environment is located.
Run the Script#
Finally, include a line to run the Python script.
File: submit-fine-tuning.sh
# Run the Python script
python -u /path/to/python/script/fine_tuning.py
If successful, your job will complete without errors, and output files will appear in your working directory or specified output path.
Troubleshooting#
Common issues:
Model fails to download: Check available storage in your
HF_HOMElocation and the time limit on the batch jobModule errors: Ensure
module purgeand required modules are loadedMissing files: Verify file paths
Check Slurm output (output-<jobid>.out) and error (error-<jobid>.err) files for details.