Interpretability for Neural Networks
NNsight (/ɛn.saɪt/) is a package for interpreting and manipulating the internals of deep learning models.
What is NNsight?¶
NNsight is a Python library that enables interpreting and intervening on the internals of deep learning models. It provides a clean, Pythonic interface for:
- Accessing activations at any layer during forward passes
- Modifying activations to study causal effects
- Computing gradients with respect to intermediate values
- Batching interventions across multiple inputs efficiently
Originally developed by the NDIF team at Northeastern University, NNsight supports local execution on any PyTorch model and remote execution on large models via the NDIF infrastructure.
What does that look like?¶
Install NNSight:
Intervene:
from nnsight import LanguageModel
model = LanguageModel('openai-community/gpt2', device_map='auto', dispatch=True)
with model.trace('The Eiffel Tower is in the city of', remote=True/False):
# Intervene on activations
model.transformer.h[0].output[0][:] = 0
# Access and save hidden states
hidden_states = model.transformer.h[-1].output[0].save()
# Get model output
output = model.output.save()
print(model.tokenizer.decode(output.logits.argmax(dim=-1)[0]))