Multiple Token Generation#
Summary#
NNsight supports multiple token generation using the .generate()
method.
with model.generate(prompt, max_new_tokens=N):
out = model.generator.output.save()
There are a couple of methods for intervening on models during generation. If you’d like to apply interventions to all generations of the model, you should use with tracer.all():
.
[ ]:
with model.generate(prompt, max_new_tokens=N) as tracer:
with tracer.all():
# Apply intervention to each generation
If you’d like to apply interventions to only specific generations of the model (i.e., generation slices [1:5]
or [4, 7]
), you should use the with tracer.iter[<slice>]:
context.
[ ]:
intervention_slice = <slice> # up to N
with model.generate(prompt, max_new_tokens=N) as tracer:
with tracer.iter[intervention_slice] as idx:
# Apply intervention to only specific generation iterations
When to Use#
Generate is used whenever you’d like to generate multiple tokens at once. This may be during chatting contexts, or during more complex experiments that require generating text longer than one token.
How to Use#
Let’s load up gpt2
to try out generation in NNsight.
[2]:
from nnsight import LanguageModel
model = LanguageModel('openai-community/gpt2', device_map='auto')
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:104: UserWarning:
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).
warnings.warn(
.generate()
#
NNsight’s LanguageModel
class supports multiple token generation with .generate()
. You can control the number of new tokens generated by setting max_new_tokens = N
within your call to .generate()
.
[3]:
prompt = 'The Eiffel Tower is in the city of'
n_new_tokens = 3
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
out = model.generator.output.save()
decoded_prompt = model.tokenizer.decode(out[0][0:-n_new_tokens].cpu())
decoded_answer = model.tokenizer.decode(out[0][-n_new_tokens:].cpu())
print("Prompt: ", decoded_prompt)
print("Generated Answer: ", decoded_answer)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You have set `compile_config`, but we are unable to meet the criteria for compilation. Compilation will be skipped.
Prompt: The Eiffel Tower is in the city of
Generated Answer: Paris, and
.all()
applies interventions to all generated tokens#
With nnsight 0.4
and later, you can use .all()
to recursively apply interventions to a model. To use .all()
, you create a .all()
context using the tracer
object. Any code defined within a .all()
context on a module within a model will recursively apply its .input
and .output
across all iterations of the generation.
Let’s try using .all()
to streamline the multiple token generation process. We simply call .all()
on the tracer, apply our intervention, and append our hidden states (stored in an nnsight.list()
object).
[4]:
# using .all():
prompt = 'The Eiffel Tower is in the city of'
layers = model.transformer.h
n_new_tokens = 50
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
hidden_states = list().save() # Initialize & .save() nnsight list
# Call .all() to apply intervention to each new token
with tracer.all():
# Apply intervention - set first layer output to zero
layers[0].output[0][:] = 0
# Append desired hidden state post-intervention
hidden_states.append(layers[-1].output) # no need to call .save
# Don't need to loop or call .next()!
print("Hidden state length: ",len(hidden_states))
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hidden state length: 50
.iter()
for interventions only on specific generation iterations#
You can use with tracer.iter[<slice>]:
to intervene on specific iterations of generation only. (Note: If needed, you can grab the index of the iteration with tracer.iter[<slice>] as idx:
)
Let’s try intervening only the 2nd through 5th generations of the above prompt:
[5]:
# using .all():
prompt = 'The Eiffel Tower is in the city of'
layers = model.transformer.h
n_new_tokens = 50
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
hidden_states = list().save() # Initialize & .save() nnsight list
# Call .all() to apply intervention to each new token
with tracer.iter[2:5]:
# Apply intervention - set first layer output to zero
layers[0].output[0][:] = 0
# Append desired hidden state post-intervention
hidden_states.append(layers[-1].output) # no need to call .save
# Don't need to loop or call .next()!
print("Hidden state length: ",len(hidden_states))
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hidden state length: 3