Info
Last Execution: 2026-02-17
| Package | Version |
|---|---|
| nnsight | 0.5.15 |
| Python | 3.12.3 |
| torch | 2.10.0+cu128 |
| transformers | 5.2.0 |
Setting Activations¶
Setting is how you intervene on a model by editing activations as they flow through the network. This is the basis of techniques like activation patching, ablation, and steering.
Setup¶
from nnsight import LanguageModel
model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)
In-Place Setting¶
Use slice assignment to modify a tensor's values in-place. The original tensor object is mutated, so downstream modules see the change immediately.
with model.trace("The Eiffel Tower is in the city of"):
# Clone before the edit so we can compare
before = model.transformer.h[0].output[0].clone().save()
# Zero out all activations at layer 0
model.transformer.h[0].output[0][:] = 0
after = model.transformer.h[0].output[0].save()
print("Before:", before[0, 0, :5])
print("After: ", after[0, 0, :5])
Before: tensor([ 0.1559, -0.7946, 0.3943, 0.3413, -0.5653], device='cuda:0',
grad_fn=<SliceBackward0>)
After: tensor([0., 0., 0., 0., 0.], device='cuda:0', grad_fn=<SliceBackward0>)
Clone before saving in-place modifications
When modifying in-place, the saved reference and the modified tensor point to the same memory. If you want to capture the "before" state, call .clone() before the modification.
Replacement Setting¶
Assign a completely new tensor to a module's output. This replaces the tensor object rather than mutating it. No need to .clone() when comparing before/after.
with model.trace("The Eiffel Tower is in the city of"):
before = model.transformer.wte.output.save()
# Replace the embedding output with a scaled version
model.transformer.wte.output = model.transformer.wte.output * 0.5
after = model.transformer.wte.output.save()
print("Before:", before[0, 0, :5])
print("After: ", after[0, 0, :5])
Before: tensor([-0.0686, -0.0203, 0.0645, -0.0621, -0.1135], device='cuda:0',
grad_fn=<SliceBackward0>)
After: tensor([-0.0343, -0.0101, 0.0322, -0.0310, -0.0568], device='cuda:0',
grad_fn=<SliceBackward0>)
Setting Specific Positions¶
You can target specific batch items, token positions, or hidden dimensions using standard indexing.
with model.trace("The Eiffel Tower is in the city of"):
# Zero out only the last token position at layer 5
model.transformer.h[5].output[0][:, -1, :] = 0
output = model.lm_head.output.save()
predicted = model.tokenizer.decode(output[0, -1].argmax(dim=-1))
print(f"Predicted next token (after zeroing last position at layer 5): {predicted}")
Predicted next token (after zeroing last position at layer 5): ,
Adding a Steering Vector¶
A common intervention is adding a direction vector to activations to steer model behavior.
import torch
with model.trace("The Eiffel Tower is in the city of"):
hidden = model.transformer.h[6].output[0]
# Create a random steering vector on the same device
steering = torch.randn(hidden.shape[-1], device=hidden.device) * 10
# Add it to the last token position
model.transformer.h[6].output[0][:, -1, :] += steering
output = model.lm_head.output.save()
predicted = model.tokenizer.decode(output[0, -1].argmax(dim=-1))
print(f"Predicted next token (after steering): {predicted}")
Predicted next token (after steering): all
Handling Tuple Outputs¶
Some modules (like GPT-2 transformer blocks) return tuples. Use slice assignment on the element you want to modify, or rebuild the tuple for replacement.
with model.trace("The Eiffel Tower is in the city of"):
# In-place: modify the first element of the tuple
model.transformer.h[0].output[0][:] = 0
output = model.transformer.h[0].output.save()
print(f"Output is a {type(output).__name__} with {len(output)} elements")
Output is a tuple with 1 elements
Tuple assignment
model.layer.output[0][:] = 0 modifies the tensor inside the tuple in-place. To replace a tuple element entirely, reassign the full tuple: model.layer.output = (new_tensor,) + model.layer.output[1:].
Verifying Downstream Effects¶
Setting an early layer's output affects all downstream computations. Here we compare the final logits with and without an intervention.
with model.trace("The Eiffel Tower is in the city of"):
clean_logits = model.lm_head.output.clone().save()
with model.trace("The Eiffel Tower is in the city of"):
model.transformer.h[0].output[0][:] = 0
modified_logits = model.lm_head.output.save()
diff = (clean_logits - modified_logits).abs().mean()
print(f"Mean absolute difference in logits: {diff:.4f}")
print(f"Clean prediction: {model.tokenizer.decode(clean_logits[0, -1].argmax(dim=-1))}")
print(f"Modified prediction: {model.tokenizer.decode(modified_logits[0, -1].argmax(dim=-1))}")
Mean absolute difference in logits: 59.8508 Clean prediction: Paris Modified prediction: ,