Blog¶

July 13, 2026
in Guide
16 min read

NNsight × vLLM: Interpretability at Production Scale

The problem

There is growing demand for deploying interpretability at production scale: harvesting activations from frontier models for SAE training, steering and probing behavior during live serving, and study internals of models with hundreds of billions of parameters. Hugging Face Transformers is the natural starting point for interpretability work. At modest scale it works fine, but at production workload it falls short: weak support for distributed serving, and missing modern inference optimizations that production workload needs. Production engines like vLLM and SGLang fill both gaps, with distributed worker pools (multi-dimension parallelism, multi-node deployment) and modern inference optimizations (paged KV caches, continuous batching, request scheduling, fused kernels). All of these make interpretability substantially harder than on Transformers: the engine drives the forward pass, a single module's output can be sharded across GPUs and nodes, and the decode hot path runs through optimized implementation like CUDA graphs or compiled code that Python hooks cannot access.

April 30, 2026
in Guide
10 min read

How to make your NDIF experiment 130x faster

A user had reached out to me recently asking how they could make their nnsight code faster with NDIF to meet a project deadline. After looking at their code, I introduced a number of improvements that leverage nnsight features and remote execution principles. The result was a 130x improvement speedup.

The experience was successful; I drew many useful lessons, and I want to share with you these key principles, so you can too optimally implement your experiments for remote execution.

TLDR;

If you're doing more than one forward pass, wrap them in a model.session
Downloading large tensors can be costly, only .save() what you need
Cache all your activations in one go
Reduce loops with Batching Invokes
.skip what you can

April 16, 2026
in Guide
16 min read

Extending NNsight: From Custom Envoys to Your Own Model Class

By Jaden Fiotto-Kaufman

NNsight works out of the box on any torch.nn.Module. Wrap it, open a trace, read .output, save it. For a lot of interpretability work, that's all you need.

But the longer you spend doing this work, the more patterns you notice. You catch yourself writing the same six-line projection chain for every layer of a logit-lens sweep. You reshape attention heads in every single notebook. You wrap a model that isn't on HuggingFace and discover you now have to rebuild tokenization, batching, and generation by hand. You start wanting NNsight to speak your model's vocabulary.

NNsight is designed to be extended at exactly these points. This post is a cookbook of the extension surface — from lightweight per-module conveniences all the way down to custom execution backends. Pick the cheapest primitive that solves your problem; don't reach for a custom backend when a three-line eproperty will do.

March 24, 2026
in Ecosystem
7 min read

Calling all Lies

AI Deception Is More Than Just Getting Facts Wrong

Models can lie about what they're capable of, fabricate plausible-sounding details under social pressure, strategically blend truth with fiction, or dodge questions they can't answer honestly. NDIF and Cadenza Labs are hosting a competition to study how models lie and are looking for red teams to create scenarios where models contradict their own beliefs (RFP: https://cadenza-labs.github.io/red-team-rfp/).

This competition is inspired by Liars’ Bench from Cadenza Labs. Their benchmark of over 72,000 labeled examples organizes LLM lies along two key dimensions: what the model lies about (world knowledge, its own capabilities, its actions, its policies) and why it lies (inherent behavioral patterns vs. context-driven pressure). The comprehensive benchmark spans from simple factual falsehoods to subtle introspective lies, and should serve as a starting point for red team scenario design.

In response to our RFP, we want red teams to cultivate a variety of creative and diverse deception scenarios, which blue teams will then use to build novel and robust lie detector method(s). In this blog post, we walk through an example red team project to inspire creative proposal submissions.

February 26, 2026
in Release
13 min read

Introducing NNsight 0.6

By Jaden Fiotto-Kaufman

NNsight is releasing its sixth major version, focused on addressing user feedback about common hurdles with the library.

Wait, What is NNsight Again?

If you are a new user or you haven't used NNsight in a while, here's a small refresher! NNsight is a Python library for interpreting and intervening on the internals of PyTorch models. You wrap a model, open a tracing context, and read or write activations at any layer. While the tool supports any Pytorch model, we also provide first-class support for popular architectures such as transformers and diffusers.

from nnsight import LanguageModel

model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)

with model.trace("The Eiffel Tower is in"):
    # Read the hidden states at layer 5
    hidden = model.transformer.h[5].output[0].save()

    # Zero out the MLP output at layer 0
    model.transformer.h[0].mlp.output[:] = 0

Under the hood, NNsight uses deferred execution. When you enter with model.trace(...), your code AST is extracted, compiled into a function, and run in a worker thread. When that thread accesses .output, it waits until the model's forward pass reaches the selected module, extracting the desired output tensor through a PyTorch hook. This means your intervention code is fully aligned with the forward pass—no proxies, no fake tensors. You're working with real PyTorch values!

February 26, 2026
in Ecosystem
4 min read

NNterp Integration

We welcome nnterp to the NDIF ecosystem! The nnterp library is built on top of nnsight, providing standardized transformer architecture for many LLMs and implementations of common interpretability techniques. Let's explore how it works!