Info

Last Execution: 2026-02-17

Package	Version
nnsight	0.5.15
Python	3.12.3
torch	2.10.0+cu128
transformers	5.2.0

Accessing Intermediate Operations¶

.output and .input let you hook into a module's inputs and outputs. But what about the operations inside a module's forward pass? .source lets you access any intermediate operation — function calls, method calls, tensor operations — within a module's forward method.

Setup¶

In [1]:

Copied!

from nnsight import LanguageModel

model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)
from nnsight import LanguageModel

model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)

Discovering Operations¶

Print .source on any module to see its forward method with all hookable operations labeled.

In [2]:

Copied!

print(model.transformer.h[0].mlp.source)
print(model.transformer.h[0].mlp.source)

                    * def forward(self, hidden_states: tuple[torch.FloatTensor] | None) -> torch.FloatTensor:
 self_c_fc_0    ->  0     hidden_states = self.c_fc(hidden_states)
 self_act_0     ->  1     hidden_states = self.act(hidden_states)
 self_c_proj_0  ->  2     hidden_states = self.c_proj(hidden_states)
 self_dropout_0 ->  3     hidden_states = self.dropout(hidden_states)
                    4     return hidden_states
                    5

Each labeled line (like self_c_fc_0, self_act_0) is an operation you can access inside a trace. The number suffix distinguishes multiple calls to the same function.

Larger modules have more operations. Here's the attention module:

In [3]:

Copied!

print(model.transformer.h[0].attn.source)
print(model.transformer.h[0].attn.source)

                                             * def forward(
                                             0     self,
                                             1     hidden_states: tuple[torch.FloatTensor] | None,
                                             2     past_key_values: Cache | None = None,
                                             3     cache_position: torch.LongTensor | None = None,
                                             4     attention_mask: torch.FloatTensor | None = None,
                                             5     encoder_hidden_states: torch.Tensor | None = None,
                                             6     encoder_attention_mask: torch.FloatTensor | None = None,
                                             7     output_attentions: bool | None = False,
                                             8     **kwargs,
                                             9 ) -> tuple[torch.Tensor | tuple[torch.Tensor], ...]:
                                            10     is_cross_attention = encoder_hidden_states is not None
                                            11     if past_key_values is not None:
 isinstance_0                            -> 12         if isinstance(past_key_values, EncoderDecoderCache):
 past_key_values_is_updated_get_0        -> 13             is_updated = past_key_values.is_updated.get(self.layer_idx)
                                            14             if is_cross_attention:
                                            15                 # after the first generated id, we can subsequently re-use all key/value_layer from cache
                                            16                 curr_past_key_values = past_key_values.cross_attention_cache
                                            17             else:
                                            18                 curr_past_key_values = past_key_values.self_attention_cache
                                            19         else:
                                            20             curr_past_key_values = past_key_values
                                            21 
                                            22     if is_cross_attention:
 hasattr_0                               -> 23         if not hasattr(self, "q_attn"):
 ValueError_0                            -> 24             raise ValueError(
                                            25                 "If class is used as cross attention, the weights `q_attn` have to be defined. "
                                            26                 "Please make sure to instantiate class with `GPT2Attention(..., is_cross_attention=True)`."
                                            27             )
 self_q_attn_0                           -> 28         query_states = self.q_attn(hidden_states)
                                            29         attention_mask = encoder_attention_mask
                                            30 
                                            31         # Try to get key/value states from cache if possible
                                            32         if past_key_values is not None and is_updated:
                                            33             key_states = curr_past_key_values.layers[self.layer_idx].keys
                                            34             value_states = curr_past_key_values.layers[self.layer_idx].values
                                            35         else:
 self_c_attn_0                           -> 36             key_states, value_states = self.c_attn(encoder_hidden_states).split(self.split_size, dim=2)
 split_0                                 ->  +             ...
                                            37             shape_kv = (*key_states.shape[:-1], -1, self.head_dim)
 key_states_view_0                       -> 38             key_states = key_states.view(shape_kv).transpose(1, 2)
 transpose_0                             ->  +             ...
 value_states_view_0                     -> 39             value_states = value_states.view(shape_kv).transpose(1, 2)
 transpose_1                             ->  +             ...
                                            40     else:
 self_c_attn_1                           -> 41         query_states, key_states, value_states = self.c_attn(hidden_states).split(self.split_size, dim=2)
 split_1                                 ->  +         ...
                                            42         shape_kv = (*key_states.shape[:-1], -1, self.head_dim)
 key_states_view_1                       -> 43         key_states = key_states.view(shape_kv).transpose(1, 2)
 transpose_2                             ->  +         ...
 value_states_view_1                     -> 44         value_states = value_states.view(shape_kv).transpose(1, 2)
 transpose_3                             ->  +         ...
                                            45 
                                            46     shape_q = (*query_states.shape[:-1], -1, self.head_dim)
 query_states_view_0                     -> 47     query_states = query_states.view(shape_q).transpose(1, 2)
 transpose_4                             ->  +     ...
                                            48 
                                            49     if (past_key_values is not None and not is_cross_attention) or (
                                            50         past_key_values is not None and is_cross_attention and not is_updated
                                            51     ):
                                            52         # save all key/value_layer to cache to be re-used for fast auto-regressive generation
                                            53         cache_position = cache_position if not is_cross_attention else None
 curr_past_key_values_update_0           -> 54         key_states, value_states = curr_past_key_values.update(
                                            55             key_states, value_states, self.layer_idx, {"cache_position": cache_position}
                                            56         )
                                            57         # set flag that curr layer for cross-attn is already updated so we can re-use in subsequent calls
                                            58         if is_cross_attention:
                                            59             past_key_values.is_updated[self.layer_idx] = True
                                            60 
                                            61     using_eager = self.config._attn_implementation == "eager"
 ALL_ATTENTION_FUNCTIONS_get_interface_0 -> 62     attention_interface: Callable = ALL_ATTENTION_FUNCTIONS.get_interface(
                                            63         self.config._attn_implementation, eager_attention_forward
                                            64     )
                                            65 
                                            66     if using_eager and self.reorder_and_upcast_attn:
 self__upcast_and_reordered_attn_0       -> 67         attn_output, attn_weights = self._upcast_and_reordered_attn(
                                            68             query_states, key_states, value_states, attention_mask
                                            69         )
                                            70     else:
 attention_interface_0                   -> 71         attn_output, attn_weights = attention_interface(
                                            72             self,
                                            73             query_states,
                                            74             key_states,
                                            75             value_states,
                                            76             attention_mask,
                                            77             dropout=self.attn_dropout.p if self.training else 0.0,
                                            78             **kwargs,
                                            79         )
                                            80 
 attn_output_reshape_0                   -> 81     attn_output = attn_output.reshape(*attn_output.shape[:-2], -1).contiguous()
 contiguous_0                            ->  +     ...
 self_c_proj_0                           -> 82     attn_output = self.c_proj(attn_output)
 self_resid_dropout_0                    -> 83     attn_output = self.resid_dropout(attn_output)
                                            84 
                                            85     return attn_output, attn_weights
                                            86

Getting an Intermediate Value¶

Access any labeled operation's .output inside a trace — just like you would with a module.

In [4]:

Copied!

with model.trace("The Eiffel Tower is in the city of"):
    # Get the hidden states after the GELU activation (before projection)
    post_gelu = model.transformer.h[0].mlp.source.self_act_0.output.save()

print(f"Post-GELU shape: {post_gelu.shape}")
with model.trace("The Eiffel Tower is in the city of"):
    # Get the hidden states after the GELU activation (before projection)
    post_gelu = model.transformer.h[0].mlp.source.self_act_0.output.save()

print(f"Post-GELU shape: {post_gelu.shape}")

Post-GELU shape: torch.Size([1, 10, 3072])

How source works

When you access .source, nnsight rewrites the module's forward method to inject hooks into every operation. Each operation gets .input, .inputs, and .output properties — just like a regular module.

Setting an Intermediate Value¶

You can also modify intermediate values. Here we zero out the MLP's post-GELU activations at layer 11 to see how it affects the prediction:

In [5]:

Copied!





with model.trace("The Eiffel Tower is in the city of"):
    normal_logits = model.lm_head.output.save()

with model.trace("The Eiffel Tower is in the city of"):
    # Zero the MLP's GELU output at layer 11
    model.transformer.h[11].mlp.source.self_act_0.output[:] = 0
    modified_logits = model.lm_head.output.save()

print(f"Normal:      {model.tokenizer.decode(normal_logits[0, -1].argmax(dim=-1))}")
print(f"Zeroed GELU: {model.tokenizer.decode(modified_logits[0, -1].argmax(dim=-1))}")
with model.trace("The Eiffel Tower is in the city of"):
    normal_logits = model.lm_head.output.save()

with model.trace("The Eiffel Tower is in the city of"):
    # Zero the MLP's GELU output at layer 11
    model.transformer.h[11].mlp.source.self_act_0.output[:] = 0
    modified_logits = model.lm_head.output.save()

print(f"Normal:      {model.tokenizer.decode(normal_logits[0, -1].argmax(dim=-1))}")
print(f"Zeroed GELU: {model.tokenizer.decode(modified_logits[0, -1].argmax(dim=-1))}")

Normal:       Paris
Zeroed GELU:  London

Patching Between Layers¶

Transfer an intermediate value from one layer to another:

In [6]:

Copied!

with model.trace("The Eiffel Tower is in the city of"):
    # Capture layer 0's post-GELU activations
    gelu_0 = model.transformer.h[0].mlp.source.self_act_0.output

    # Patch them into layer 5
    model.transformer.h[5].mlp.source.self_act_0.output = gelu_0

    logits = model.lm_head.output.save()

print(f"Patched MLP prediction: {model.tokenizer.decode(logits[0, -1].argmax(dim=-1))}")
with model.trace("The Eiffel Tower is in the city of"):
    # Capture layer 0's post-GELU activations
    gelu_0 = model.transformer.h[0].mlp.source.self_act_0.output

    # Patch them into layer 5
    model.transformer.h[5].mlp.source.self_act_0.output = gelu_0

    logits = model.lm_head.output.save()

print(f"Patched MLP prediction: {model.tokenizer.decode(logits[0, -1].argmax(dim=-1))}")

Patched MLP prediction:  London

Recursive Source Tracing¶

.source works recursively. If a labeled operation calls another function, you can chain .source to trace into it. For example, the attention interface function internally calls scaled_dot_product_attention:

In [7]:

Copied!





with model.trace("The Eiffel Tower is in the city of"):
    # Trace into the attention interface → access the SDPA call's input (the query tensor)
    sdpa = model.transformer.h[0].attn.source.attention_interface_0.source
    query = sdpa.torch_nn_functional_scaled_dot_product_attention_0.input.save()

print(f"Query shape: {query.shape}")  # [batch, heads, seq_len, head_dim]
with model.trace("The Eiffel Tower is in the city of"):
    # Trace into the attention interface → access the SDPA call's input (the query tensor)
    sdpa = model.transformer.h[0].attn.source.attention_interface_0.source
    query = sdpa.torch_nn_functional_scaled_dot_product_attention_0.input.save()

print(f"Query shape: {query.shape}")  # [batch, heads, seq_len, head_dim]

Query shape: torch.Size([1, 12, 10, 64])

Viewing a specific operation

Print a specific operation to see it highlighted in its surrounding context:

print(model.transformer.h[0].attn.source.self_c_proj_0)

This shows the operation with an arrow (-->) and surrounding lines for context.

Don't chain .source through another .source on submodules

If a .source listing shows a submodule call (like self.c_proj), access it directly on the model — not through .source:

# Wrong — don't chain through .source
model.transformer.h[0].attn.source.self_c_proj.source

# Correct — access the submodule directly
model.transformer.h[0].attn.c_proj.source