Dictionary Learning#

📗 You can find an interactive Colab version of this tutorial here.

Polysemanticity#

The field of mechanistic interpretability focuses on understanding individual components of neural networks. However, many neurons in these networks respond to multiple (seemingly unrelated) inputs – a phenomenon called polysemanticity. That is, a single neuron might separately respond to images of car tires and rubber ducks.

Although polysemanticity may help networks fit as many features as possible into a given parameter space, it makes it more difficult for humans to interpret the network’s actions. There are a few strategies for finding monosemantic features, but in this tutorial we will explore the use of sparse autoencoders. If you are interested in learning more, this idea is explored by Anthropic in Towards Monosemanticity and Scaling Monosemanticity.

Sparse Autoencoders#

Sparse autoencoders (SAEs) are algorithms that can extract learned features from a trained model. SAEs are a form of dictionary learning algorithms, which find a sparse representation of input data in a high-dimensional space. These features can serve as a more focused and monosemantic unit of analysis than the model’s individual neurons, helping address polysemanticity and enabling a clearer and more interpretable understanding of model behavior.

📚 This tutorial is adapted from work by Samuel Marks and Aaron Mueller (See their GitHub Repository and Alignment Forum post). They created this repository as a resource for dictionary learning via sparse autoencoders on neural network activations, using Anthropic’s approach detailed here.

Here, we will use one of their pre-trained autoencoders to explore how it creates an easily-interpretable monosemantic relationship between tokens and feature activation.

Setup#

Install NNsight & Dictionary Learning libraries

[3]:

from IPython.display import clear_output
try:
    import google.colab
    is_colab = True
except ImportError:
    is_colab = False

if is_colab:
    !pip install -U nnsight
    !git clone https://github.com/saprmarks/dictionary_learning
    %cd dictionary_learning
    !pip install -r requirements.txt
clear_output()

[4]:

from nnsight import LanguageModel
from dictionary_learning.dictionary import AutoEncoder
import torch

[5]:

# Load pretrained autoencoder
!./pretrained_dictionary_downloader.sh
clear_output()

weights_path = "./dictionaries/pythia-70m-deduped/mlp_out_layer0/10_32768/ae.pt"
activation_dim = 512 # dimension of the NN's activations to be autoencoded
dictionary_size = 64 * activation_dim # number of features in the dictionary

ae = AutoEncoder(activation_dim, dictionary_size)
ae.load_state_dict(torch.load(weights_path,weights_only=True))
ae.cuda()

[5]:

AutoEncoder(
  (encoder): Linear(in_features=512, out_features=32768, bias=True)
  (decoder): Linear(in_features=32768, out_features=512, bias=False)
)

Apply SAE#

[7]:

model = LanguageModel("EleutherAI/pythia-70m-deduped", device_map="auto")
tokenizer = model.tokenizer

prompt = """
Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off--then, I account it high time to get to sea as soon as I can.
"""

# Extract layer 0 MLP output from base model
with model.trace(prompt) as tracer:
    mlp_0 = model.gpt_neox.layers[0].mlp.output.save()

# Use SAE to get features from activations
features = ae.encode(mlp_0)

[8]:

# Find top features using the autoencoder
summed_activations = features.abs().sum(dim=1) # Sort by max activations
top_activations_indices = summed_activations.topk(20).indices # Get indices of top 20

compounded = []
for i in top_activations_indices[0]:
    compounded.append(features[:,:,i.item()].cpu()[0])

compounded = torch.stack(compounded, dim=0)

Visualization#

With Autoencoder#

Now let’s take a look at each of the top 20 most active features and what they respond to in our prompt. Note that each feature only responds to one token, making these features highly interpretable!

[9]:

from circuitsvis.tokens import colored_tokens_multi

tokens = tokenizer.encode(prompt)
str_tokens = [tokenizer.decode(t) for t in tokens]

# Visualize activations for top 20 most prominent features
colored_tokens_multi(str_tokens, compounded.T)

[9]:

Without Autoencoder (optional comparison)#

Without the autoencoder, the top neurons are active (or negatively associated) for many tokens, demonstrating how individual neurons can be difficult to interpret.

[13]:

# Find top neurons using the MLP output
summed_activations_or = mlp_0.abs().sum(dim=1) # Sort by max activations
top_activations_indices_or = summed_activations_or.topk(20).indices # Get indices of top 20

compounded_orig = []
for i in top_activations_indices_or[0]:
    compounded_orig.append(mlp_0[:,:,i.item()].cpu()[0])

compounded_orig = torch.stack(compounded_orig, dim=0)

[14]:

from circuitsvis.tokens import colored_tokens_multi

tokens = tokenizer.encode(prompt)
str_tokens = [tokenizer.decode(t) for t in tokens]

# Visualize original activations for top 20 most prominent neurons
colored_tokens_multi(str_tokens, compounded_orig.T)

[14]: