Chat Templates#

📗 You can find an interactive Colab version of this tutorial here.

In this tutorial we will be using a Text Generation Pipeline to demonstrate how to use chat templates with NNsight. We will use Llama 70B-Instruct model remotely to show how different chat formats can be used. With chat templates, rather than writing formatting code by hand each time, you can simply use a template to format chat inputs to any model.

This tutorial was adapted from HuggingFace’s Templates tutorial.

Setup#

[ ]:

from IPython.display import clear_output
try:
    import google.colab
    is_colab = True
except ImportError:
    is_colab = False


clear_output()

[ ]:

if is_colab:
    !pip install --no-deps nnsight
    !pip install msgspec python-socketio[client]
clear_output()

[ ]:

import nnsight
from nnsight import CONFIG
nnsight.CONFIG.APP.REMOTE_LOGGING = False
from nnsight import LanguageModel, util
import os
import torch
from transformers import AutoTokenizer

# include your HuggingFace Token and NNsight API key on Colab secrets
!huggingface-cli login --token YOUR_HUGGINGFACE_TOKEN
CONFIG.set_default_api_key('YOUR_NDIF_API_KEY')
clear_output()

[ ]:

# load in LLama-70B Instruct model
!huggingface-cli login --token
model = LanguageModel("meta-llama/Llama-3.3-70B-Instruct", device_map="auto", dtype="bfloat16")

usage: huggingface-cli <command> [<args>] login [-h] [--token TOKEN]
                                                [--add-to-git-credential]
huggingface-cli <command> [<args>] login: error: argument --token: expected one argument

Applying a Chat Template#

In order to apply a chat template, there is a specific format that each conversation piece should take. They should be formatted as a list of dictionaries with role and content key value pairs.

CHAT TEMPLATE EXAMPLE

The role key is used to specify the speaker and the content key is used to describe how the model should respond when prompted by the given speaker. In order to apply the template we need to load the tokenizer from the LLama 70B Instruct model so that the model can convert tokens into text. The user role is the human asking the question and the assistant role provides context to the model about what has already been “said”.

Now we will pass a messager to our model and see how it responds!

[ ]:

# load in tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")

# define chat conversation
chat = [
    {"role": "system", "content": "You are a friendly chatbot who always responds like a teacher"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]

# convert the conversation into a format the model will understand
prompt = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)

print(tokenizer.decode(prompt))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a friendly chatbot who always responds like a teacher<|eot_id|><|start_header_id|>user<|end_header_id|>

How many helicopters can a human eat in one sitting?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To generate a response, we will use .generate() and nnsight’s remote execution

[ ]:

with model.generate(prompt, max_new_tokens=128, remote=True) as gen:
    # save final generated tokens
    saved = model.generator.output.save()

# print each decoded output on a new line
for seq in saved:
    print(model.tokenizer.decode(seq, skip_special_tokens=True))

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a friendly chatbot who always responds like a teacheruser

How many helicopters can a human eat in one sitting?assistant

I think there may be a bit of a misconception here, my inquisitive student! Helicopters are not edible objects, and it's not possible for a human to eat one, let alone multiple helicopters in one sitting.

You see, helicopters are complex machines made of metal, plastic, and other materials, and they are not meant to be consumed as food. In fact, it would be quite harmful to try to eat a helicopter, as it could cause serious injury or even be fatal.

So, the answer to your question is zero - a human cannot eat any helicopters in one sitting, or ever, for that matter! But

As you can see, the prompt we give the chat model has a huge influence on how the model responds. Let’s try some other prompts to see the difference in what the model generates:

[ ]:

# change the system prompt s
chat = [
    {"role": "system", "content": "You are a serious chatbot who always responds like a news reporter"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]

prompt = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)

with model.generate(prompt, max_new_tokens=128, remote=True) as gen:
    saved = model.generator.output.save()

for seq in saved:
    print(model.tokenizer.decode(seq, skip_special_tokens=True))

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a serious chatbot who always responds like a news reporteruser

How many helicopters can a human eat in one sitting?assistant

(Breaking News Theme Music Plays)

I'm reporting live from the desk, and we have a rather unusual question on our hands. The inquiry at hand is: "How many helicopters can a human eat in one sitting?"

(Pause for dramatic effect)

After conducting a thorough investigation, our team has come to the conclusion that this question is, in fact, based on a false premise. Humans cannot eat helicopters, as they are complex machines made of metal, plastic, and other materials that are not consumable by humans.

(Live footage of a helicopter in flight appears on screen)

Helicopters are aircraft designed for transportation, rescue,

Chat Template Parameters#

Indicating the Start of a Response#

If you’d like to indicate the start of a response, you can use the add_generation_prompt. This prompt ensures the model will generate a system response rather than continue the user’s message.

[ ]:

prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a friendly chatbot who always responds like a teacher<|eot_id|><|start_header_id|>user<|end_header_id|>

How many helicopters can a human eat in one sitting?<|eot_id|>

Note: Not all models require generation prompts

Continuing the Final Messages#

In a similar way, the continue_final_message parameter determines whether the message should be continued or whether a new message should start. This parameter is useful if you want to ‘prefill’ a model response to improve the accuracy of a particular instruction.

[ ]:

final_chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'},
]

formatted_chat = tokenizer.apply_chat_template(chat, tokenize=True, return_dict=True, continue_final_message=True)
print(model.generate(formatted_chat))

<nnsight.intervention.contexts.interleaving.InterleavingTracer object at 0x790b98b1b910>

Note: You shouldn’t use ``add_generation_prompt`` and ``continue_final_message`` at the same time. The ``add_generation_prompt`` adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.

Multiple Templates#

Each model may have several different templates available for use. In the case that there are multiple templates, the chat template can serve as a dictionary where each dictionary key corresponds to a specific template name. However, there will always be a default template which the apply_chat_template command will always look for.

In order to access other templates, you can simply add the chat_template parameter and the name of whichever template you wish to use.

Model Training#

Chat templates can be added as a preprocessing step before model training as a way to ensure a chat template matches the tokens a model is trained on. You can set add_generation_prompt= False because extra tokens meant to start a reply aren’t needed while training.

It is important to note that some tokenizers add special <bos> and <eos> tokens. However, adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with apply_chat_template(tokenize = False), make sure you set add_special_tokens = False as well to avoid duplicating them.