vLLM Support

In [ ]:

Copied!





# # vLLM Support

# ## Summary

# [vLLM](https://github.com/vllm-project/vllm) is a popular library used for fast inference. By leveraging PagedAttention, dynamic batching, and Hugging Face model integration, vLLM makes inference more efficient and scalable for real-world applications.
#
# Starting with `NNsight 0.6`, NNsight includes support for investigations of vLLM-run models.

# ```python
# # instantiating vllm model
# from nnsight.modeling.vllm import VLLM
#
# vllm = VLLM("model_ID")
# ```

# ## When to Use

# `vLLM` is useful for performance speed-ups, particularly for experiments with multiple batches or generations.
#
# A few considerations when choosing to use `vLLM` for your experiments:
#
# - `NNsight` supports `vLLM` text-generation models. You can find a list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation). Our support currently does not extend to multimodal or image generation models.
# - Be aware that `vLLM` results may also differ from the base `Transformer` model results, even for the same experiment.
# - Note that  `vLLM` models do not use gradients, if you want to research gradient methods, use `LanguageModel` instead.
#
# <details>
# <summary>
# More info:
# </summary>
#
# `vLLM` speeds up inference through its paged attention mechanism. This means that accessing gradients and backward passes are not supported for vLLM models. As such, calling gradient operations when using `nnsight` `vLLM` wrappers will throw an error.
# </details>
#
#
#

# ## How to Use

# ### Setup

# You will need to use `nnsight >= 0.6`, `vllm == 0.15.1`, and `triton==3.5.0` to use vLLM with NNsight.

# # vLLM Support

# ## Summary

# [vLLM](https://github.com/vllm-project/vllm) is a popular library used for fast inference. By leveraging PagedAttention, dynamic batching, and Hugging Face model integration, vLLM makes inference more efficient and scalable for real-world applications.
#
# Starting with `NNsight 0.6`, NNsight includes support for investigations of vLLM-run models.

# ```python
# # instantiating vllm model
# from nnsight.modeling.vllm import VLLM
#
# vllm = VLLM("model_ID")
# ```

# ## When to Use

# `vLLM` is useful for performance speed-ups, particularly for experiments with multiple batches or generations.
#
# A few considerations when choosing to use `vLLM` for your experiments:
#
# - `NNsight` supports `vLLM` text-generation models. You can find a list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation). Our support currently does not extend to multimodal or image generation models.
# - Be aware that `vLLM` results may also differ from the base `Transformer` model results, even for the same experiment.
# - Note that  `vLLM` models do not use gradients, if you want to research gradient methods, use `LanguageModel` instead.
#
# 
# 
# More info:
# 
#
# `vLLM` speeds up inference through its paged attention mechanism. This means that accessing gradients and backward passes are not supported for vLLM models. As such, calling gradient operations when using `nnsight` `vLLM` wrappers will throw an error.
# 
#
#
#

# ## How to Use

# ### Setup

# You will need to use `nnsight >= 0.6`, `vllm == 0.15.1`, and `triton==3.5.0` to use vLLM with NNsight.

In [ ]:

Copied!

from IPython.display import clear_output
from pprint import pprint

get_ipython().run_line_magic('pip', 'install -U nnsight triton==3.5.0 vllm==0.15.1')

clear_output()

# ### Instantiating vLLM Models

#  Next, let's load in our NNsight vLLM model (list of vLLM-supported models & their IDs [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation)).
#
#  For this exercise, we will use `meta-llama/Llama-3.1-8B`. Note that Meta gates access to Llama models on Huggingface, so as usual, you will need a HF_TOKEN and approved access, or to use an ungated model.
#
# vLLM models require a supported GPU/backend to run.

from IPython.display import clear_output
from pprint import pprint

get_ipython().run_line_magic('pip', 'install -U nnsight triton==3.5.0 vllm==0.15.1')

clear_output()

# ### Instantiating vLLM Models

#  Next, let's load in our NNsight vLLM model (list of vLLM-supported models & their IDs [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation)).
#
#  For this exercise, we will use `meta-llama/Llama-3.1-8B`. Note that Meta gates access to Llama models on Huggingface, so as usual, you will need a HF_TOKEN and approved access, or to use an ungated model.
#
# vLLM models require a supported GPU/backend to run.

In [ ]:

Copied!

from nnsight.modeling.vllm import VLLM

# vLLM supports explicit parallelism
vllm = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=1, gpu_memory_utilization=0.8)

clear_output()

print(vllm)

# We now have a vLLM model that runs with `nnsight`.

# ### Interventions on vLLM models
# You can access and intervene on model internals in `vLLM` models just like you do for `LanguageModel` models through   `nnsight`'s `get` and `set` operations.
#
# ⚠️ **Note**: As mentioned earlier, because differences in `vLLM` inference settings and other implementation details, results may differ compared to `Transformers`, even for the same intervention!
#
# Let's load up a `LanguageModel` instance of the same `vLLM` model so we can compare the two. Here, we're loading in `Llama-3.1-8B` and making an intervention on identified antonym neurons, which should change the output to the antonym of the expected output.

from nnsight.modeling.vllm import VLLM

# vLLM supports explicit parallelism
vllm = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=1, gpu_memory_utilization=0.8)

clear_output()

print(vllm)

# We now have a vLLM model that runs with `nnsight`.

# ### Interventions on vLLM models
# You can access and intervene on model internals in `vLLM` models just like you do for `LanguageModel` models through   `nnsight`'s `get` and `set` operations.
#
# ⚠️ **Note**: As mentioned earlier, because differences in `vLLM` inference settings and other implementation details, results may differ compared to `Transformers`, even for the same intervention!
#
# Let's load up a `LanguageModel` instance of the same `vLLM` model so we can compare the two. Here, we're loading in `Llama-3.1-8B` and making an intervention on identified antonym neurons, which should change the output to the antonym of the expected output.

In [ ]:

Copied!





# Use the HuggingFace transformers backend for comparison
from nnsight import LanguageModel

neurons = [394, 5490, 8929]
prompt = "The truth is the"

# Use CUDA_VISIBLE_DEVICES in your env, not tensor_parallel_size
lm = LanguageModel("meta-llama/Llama-3.1-8B", dispatch=True, device_map="auto")
mlp = lm.model.layers[16].mlp.down_proj

with lm.trace(prompt):
    mlp.input[:, -1, neurons] = 10                # batch dimension
    out = lm.output.save()
    last = out["logits"][:, -1].argmax()          # dict of tensors
    prediction = lm.tokenizer.decode(last).save()

print(f"Prediction with transformers: '{prediction}'")


# Great, the antonym neurons appeared to do their job.
#
# Now, let's intervene on the same neurons for the `vLLM` model and see how the result changes.


# Use the HuggingFace transformers backend for comparison
from nnsight import LanguageModel

neurons = [394, 5490, 8929]
prompt = "The truth is the"

# Use CUDA_VISIBLE_DEVICES in your env, not tensor_parallel_size
lm = LanguageModel("meta-llama/Llama-3.1-8B", dispatch=True, device_map="auto")
mlp = lm.model.layers[16].mlp.down_proj

with lm.trace(prompt):
    mlp.input[:, -1, neurons] = 10                # batch dimension
    out = lm.output.save()
    last = out["logits"][:, -1].argmax()          # dict of tensors
    prediction = lm.tokenizer.decode(last).save()

print(f"Prediction with transformers: '{prediction}'")


# Great, the antonym neurons appeared to do their job.
#
# Now, let's intervene on the same neurons for the `vLLM` model and see how the result changes.

In [ ]:

Copied!





neurons = [394, 5490, 8929]
prompt = "The truth is the"

mlp = vllm.model.layers[16].mlp.down_proj

with vllm.trace(prompt):
    mlp.input = mlp.input.clone()
    mlp.input[-1, neurons] = 10               # no batch dimension
    out = vllm.output.save()
    last = out[:, -1].argmax()                # returns a tensor
    prediction = vllm.tokenizer.decode(last).save()

print(f"Prediction with vLLM: '{prediction}'")


# As expected, the results were different, indicating that these models are not interchangeable. Keep these differences in mind when working with `vLLM` vs `Transformers` models and making comparisons between the two.

# ### Batching Multiple Prompts
#
# With `LanguageModel`, you can pass multiple prompts to a single invoke:
# ```python
# with lm.trace(["prompt A", "prompt B"]):
#     ...
# ```
#
# **With vLLM, each invoke must contain exactly one prompt.** This is because vLLM treats each prompt as an independent request with its own scheduling, sampling parameters, and finish condition. Under the hood, each invoke maps 1:1 to a vLLM request.
#
# To batch multiple prompts, use a **loop of invokes** inside a single trace. vLLM's engine automatically batches the requests together for efficient GPU execution:


neurons = [394, 5490, 8929]
prompt = "The truth is the"

mlp = vllm.model.layers[16].mlp.down_proj

with vllm.trace(prompt):
    mlp.input = mlp.input.clone()
    mlp.input[-1, neurons] = 10               # no batch dimension
    out = vllm.output.save()
    last = out[:, -1].argmax()                # returns a tensor
    prediction = vllm.tokenizer.decode(last).save()

print(f"Prediction with vLLM: '{prediction}'")


# As expected, the results were different, indicating that these models are not interchangeable. Keep these differences in mind when working with `vLLM` vs `Transformers` models and making comparisons between the two.

# ### Batching Multiple Prompts
#
# With `LanguageModel`, you can pass multiple prompts to a single invoke:
# ```python
# with lm.trace(["prompt A", "prompt B"]):
#     ...
# ```
#
# **With vLLM, each invoke must contain exactly one prompt.** This is because vLLM treats each prompt as an independent request with its own scheduling, sampling parameters, and finish condition. Under the hood, each invoke maps 1:1 to a vLLM request.
#
# To batch multiple prompts, use a **loop of invokes** inside a single trace. vLLM's engine automatically batches the requests together for efficient GPU execution:

In [ ]:

Copied!





prompts = [
    "The Eiffel Tower is in the city of",
    "Madison Square Garden is in the city of",
    "The Colosseum is in the city of",
]

with vllm.trace(temperature=0.0, top_p=1) as tracer:
    predictions = list().save()

    for prompt in prompts:
        with tracer.invoke(prompt):
            token_id = vllm.logits.output.argmax(dim=-1)
            predictions.append(vllm.tokenizer.decode(token_id))

for prompt, pred in zip(prompts, predictions):
    print(f"{prompt}{pred}")


# Each invoke runs its own intervention code, but vLLM batches the underlying GPU computation across all prompts in the trace.
#
# #### Collecting results across invokes with shared state
#
# You can define variables at the trace scope and reference them inside multiple invokes. This is useful for collecting results from each prompt into a shared structure:


prompts = [
    "The Eiffel Tower is in the city of",
    "Madison Square Garden is in the city of",
    "The Colosseum is in the city of",
]

with vllm.trace(temperature=0.0, top_p=1) as tracer:
    predictions = list().save()

    for prompt in prompts:
        with tracer.invoke(prompt):
            token_id = vllm.logits.output.argmax(dim=-1)
            predictions.append(vllm.tokenizer.decode(token_id))

for prompt, pred in zip(prompts, predictions):
    print(f"{prompt}{pred}")


# Each invoke runs its own intervention code, but vLLM batches the underlying GPU computation across all prompts in the trace.
#
# #### Collecting results across invokes with shared state
#
# You can define variables at the trace scope and reference them inside multiple invokes. This is useful for collecting results from each prompt into a shared structure:

In [ ]:

Copied!





prompts = [
    "The Eiffel Tower is in the city of",
    "Madison Square Garden is in the city of",
    "The Colosseum is in the city of",
]
num_tokens = 5

with vllm.trace(temperature=0.0, top_p=1, max_tokens=num_tokens) as tracer:
    # Shared list defined at trace scope — each invoke appends to it
    all_tokens = [list() for _ in range(len(prompts))].save()

    for i, prompt in enumerate(prompts):
        with tracer.invoke(prompt):
            # tracer.all() applies to every generation step
            with tracer.all():
                all_tokens[i].append(vllm.samples.output.item())

for i, prompt in enumerate(prompts):
    generated = vllm.tokenizer.decode(all_tokens[i])
    print(f"{prompt}{generated}")


# **Key points:**
# - **One prompt per invoke** — use a loop of `tracer.invoke()` calls, not a list of prompts
# - **Shared state** — variables defined at trace scope (like `all_tokens` above) are shared across all invokes; each invoke can read and mutate them
# - **`.save()` on shared variables** — call `.save()` on any trace-scope variable you want to access after the trace exits
# - **Per-invoke sampling params** — you can pass different sampling kwargs to each invoke (e.g., `tracer.invoke(prompt, temperature=0.8)`)

# ### Sampled Token Traceability
# `vLLM` provides functionality to configure how each sequence samples its next token. Here's an example of how you can trace token sampling operations of models with the `nnsight` vLLM wrapper.


prompts = [
    "The Eiffel Tower is in the city of",
    "Madison Square Garden is in the city of",
    "The Colosseum is in the city of",
]
num_tokens = 5

with vllm.trace(temperature=0.0, top_p=1, max_tokens=num_tokens) as tracer:
    # Shared list defined at trace scope — each invoke appends to it
    all_tokens = [list() for _ in range(len(prompts))].save()

    for i, prompt in enumerate(prompts):
        with tracer.invoke(prompt):
            # tracer.all() applies to every generation step
            with tracer.all():
                all_tokens[i].append(vllm.samples.output.item())

for i, prompt in enumerate(prompts):
    generated = vllm.tokenizer.decode(all_tokens[i])
    print(f"{prompt}{generated}")


# **Key points:**
# - **One prompt per invoke** — use a loop of `tracer.invoke()` calls, not a list of prompts
# - **Shared state** — variables defined at trace scope (like `all_tokens` above) are shared across all invokes; each invoke can read and mutate them
# - **`.save()` on shared variables** — call `.save()` on any trace-scope variable you want to access after the trace exits
# - **Per-invoke sampling params** — you can pass different sampling kwargs to each invoke (e.g., `tracer.invoke(prompt, temperature=0.8)`)

# ### Sampled Token Traceability
# `vLLM` provides functionality to configure how each sequence samples its next token. Here's an example of how you can trace token sampling operations of models with the `nnsight` vLLM wrapper.

In [ ]:

Copied!





with vllm.trace("Madison Square Garden is located in the city of", temperature=0.8, top_p=0.95, max_tokens=3) as tracer:
    samples = list().save()
    logits = list().save()

    with tracer.iter[:3]:
        logits.append(vllm.logits.output)
        samples.append(vllm.samples.output)

pprint(samples)
pprint(logits) # different than samples with current sampling parameters


# ### Other features
#
# #### Intervening on generated token iterations with .all() and .iter[]
# NNSight supports iteration via `all()` and `iter()`


with vllm.trace("Madison Square Garden is located in the city of", temperature=0.8, top_p=0.95, max_tokens=3) as tracer:
    samples = list().save()
    logits = list().save()

    with tracer.iter[:3]:
        logits.append(vllm.logits.output)
        samples.append(vllm.samples.output)

pprint(samples)
pprint(logits) # different than samples with current sampling parameters


# ### Other features
#
# #### Intervening on generated token iterations with .all() and .iter[]
# NNSight supports iteration via `all()` and `iter()`

In [ ]:

Copied!





with vllm.trace("Hello World!", max_tokens=10) as tracer:
    outputs = list().save()

    # will iterate over all 10 tokens
    with tracer.all():
        out = vllm.output[:, -1]
        outputs.append(out)

print(len(outputs))
print("".join([vllm.tokenizer.decode(output.argmax()) for output in outputs]))


with vllm.trace("Hello World!", max_tokens=10) as tracer:
    outputs = list().save()

    # will iterate over all 10 tokens
    with tracer.all():
        out = vllm.output[:, -1]
        outputs.append(out)

print(len(outputs))
print("".join([vllm.tokenizer.decode(output.argmax()) for output in outputs]))

In [ ]:

Copied!





prompt = 'The Eiffel Tower is in the city of'
mlp = vllm.model.layers[16].mlp.down_proj
n_new_tokens = 50

with vllm.trace(prompt, max_tokens=n_new_tokens) as tracer:
    hidden_states = list().save() # Initialize & .save() list

    # Call .iter() to apply intervention to specific new tokens
    with tracer.iter[2:5]:

        # Apply intervention - set to zero
        mlp.input = mlp.input.clone()
        mlp.input[-1] = 0

        # Append hidden state post-intervention
        hidden_states.append(mlp.input) # no need to call .save

print("Hidden state length: ",len(hidden_states))
pprint(hidden_states)


prompt = 'The Eiffel Tower is in the city of'
mlp = vllm.model.layers[16].mlp.down_proj
n_new_tokens = 50

with vllm.trace(prompt, max_tokens=n_new_tokens) as tracer:
    hidden_states = list().save() # Initialize & .save() list

    # Call .iter() to apply intervention to specific new tokens
    with tracer.iter[2:5]:

        # Apply intervention - set to zero
        mlp.input = mlp.input.clone()
        mlp.input[-1] = 0

        # Append hidden state post-intervention
        hidden_states.append(mlp.input) # no need to call .save

print("Hidden state length: ",len(hidden_states))
pprint(hidden_states)