Info
Last Execution: 2026-02-17
| Package | Version |
|---|---|
| nnsight | 0.5.15 |
| Python | 3.12.3 |
| torch | 2.10.0+cu128 |
| transformers | 5.2.0 |
Model Editing¶
Interventions inside model.trace() are temporary — they only apply during that forward pass. With model.edit(), you can create persistently modified versions of a model that apply interventions on every forward pass.
Setup¶
import torch
import torch.nn as nn
from nnsight import LanguageModel
model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)
Creating an Edited Model¶
Use model.edit() to define persistent interventions. By default, this creates a new model reference — the original model is unchanged.
# First, capture hidden states that produce "Paris"
with model.trace("The Eiffel Tower is in the city of"):
paris_hs = model.transformer.h[-1].output[0][:, -1, :].save()
# Create an edited model that always injects "Paris" hidden states at the last layer
with model.edit() as model_edited:
model.transformer.h[-1].output[0][:, -1, :] = paris_hs
# Original model still works normally
with model.trace("Vatican is in the city of"):
original = model.lm_head.output.argmax(dim=-1).save()
# Edited model always predicts "Paris"
with model_edited.trace("Vatican is in the city of"):
modified = model.lm_head.output.argmax(dim=-1).save()
print(f"Original: {model.tokenizer.decode(original[0, -1])}")
print(f"Edited: {model.tokenizer.decode(modified[0, -1])}")
Original: Rome Edited: Paris
The edit persists across multiple calls:
prompts = [
"The Colosseum is in the city of",
"Big Ben is in the city of",
"The Statue of Liberty is in the city of",
]
for prompt in prompts:
with model_edited.trace(prompt):
tokens = model.lm_head.output.argmax(dim=-1).save()
print(f"{prompt} → {model.tokenizer.decode(tokens[0, -1])}")
The Colosseum is in the city of → Paris Big Ben is in the city of → Paris The Statue of Liberty is in the city of → Paris
How editing works
model.edit() records your interventions and replays them on every subsequent forward pass through the edited model. The edit context uses the same syntax as model.trace() — you access .output and .input on modules the same way.
In-Place Editing¶
By default, model.edit() creates a new reference, leaving the original unchanged. If you want to modify the original model directly, use inplace=True.
with model.edit(inplace=True):
model.transformer.h[-1].output[0][:, -1, :] = paris_hs
# Now the original model itself is modified
with model.trace("Vatican is in the city of"):
tokens = model.lm_head.output.argmax(dim=-1).save()
print(f"In-place edited: {model.tokenizer.decode(tokens[0, -1])}")
In-place edited: Paris
Use inplace=True with caution
In-place edits affect all subsequent forward passes through the model. This includes any model.trace() calls. If you're experimenting, prefer the default (non-inplace) mode so the original model stays clean.
Clearing Edits¶
Use .clear_edits() to remove all in-place edits and restore the model to its original state.
model.clear_edits()
with model.trace("Vatican is in the city of"):
tokens = model.lm_head.output.argmax(dim=-1).save()
print(f"After clear_edits(): {model.tokenizer.decode(tokens[0, -1])}")
After clear_edits(): Rome
Attaching Custom Modules¶
You can attach your own PyTorch modules to the model — like an SAE or LoRA adapter — and wire them into the forward pass with model.edit(). Once wired, the custom module is fully instrumented: you can access its .output in a trace just like any other module.
# Define a low-rank adapter
class Adapter(nn.Module):
def __init__(self, dim, rank=16):
super().__init__()
self.down = nn.Linear(dim, rank, bias=False)
self.up = nn.Linear(rank, dim, bias=False)
nn.init.zeros_(self.up.weight) # Start as identity (no effect)
def forward(self, x):
return self.up(self.down(x))
# Attach it to layer 5 (on the same device as the layer)
device = next(model.transformer.h[5]._module.parameters()).device
model.transformer.h[5].adapter = Adapter(768, rank=16).to(device)
# Wire the adapter into the forward pass:
# layer output += adapter(layer input)
with model.edit() as model_adapted:
h5_input = model.transformer.h[5].inputs[0][0]
adapter_out = model.transformer.h[5].adapter(h5_input, hook=True)
model.transformer.h[5].output[0][:] = model.transformer.h[5].output[0] + adapter_out
With zero-initialized weights, the adapter has no effect:
with model.trace("The Eiffel Tower is in the city of"):
orig_logits = model.lm_head.output.save()
with model_adapted.trace("The Eiffel Tower is in the city of"):
adapter_result = model.transformer.h[5].adapter.output.save()
edited_logits = model.lm_head.output.save()
print(f"Adapter output shape: {adapter_result.shape}")
print(f"Adapter norm: {adapter_result.norm():.4f}")
print(f"Original prediction: {model.tokenizer.decode(orig_logits[0, -1].argmax(dim=-1))}")
print(f"Adapted prediction: {model.tokenizer.decode(edited_logits[0, -1].argmax(dim=-1))}")
Adapter output shape: torch.Size([1, 10, 768]) Adapter norm: 0.0000 Original prediction: Paris Adapted prediction: Paris
/home/localadam/work/nnsight-website/.venv/lib/python3.12/site-packages/torch/nn/modules/linear.py:134: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:330.) return F.linear(input, self.weight, self.bias)
After training or modifying the adapter weights, it changes the model's behavior:
# Simulate trained weights by setting non-zero values
nn.init.normal_(model.transformer.h[5].adapter._module.up.weight, std=10.0)
with model_adapted.trace("The Eiffel Tower is in the city of"):
adapter_result = model.transformer.h[5].adapter.output.save()
adapted_logits = model.lm_head.output.save()
print(f"Adapter output norm: {adapter_result.norm():.4f}")
print(f"Adapted prediction: {model.tokenizer.decode(adapted_logits[0, -1].argmax(dim=-1))}")
Adapter output norm: 67712.0547 Adapted prediction: ichick
You can also intervene on the adapter itself during a trace — for example, zeroing out its contribution:
with model_adapted.trace("The Eiffel Tower is in the city of"):
# Zero out the adapter's output, neutralizing it for this run
model.transformer.h[5].adapter.output[:] = 0
logits = model.lm_head.output.save()
print(f"Adapter zeroed: {model.tokenizer.decode(logits[0, -1].argmax(dim=-1))}")
Adapter zeroed: Paris
When to use editing vs tracing
- Use
model.trace()for one-off interventions during a single forward pass - Use
model.edit()when you need the same intervention applied repeatedly across many forward passes - Use
model.edit()to wire custom modules (adapters, SAEs, probes) into the model's forward pass, making them automatically instrumented for tracing