vLLM Support
In [ ]:
Copied!
# # vLLM Support
# ## Summary
# [vLLM](https://github.com/vllm-project/vllm) is a popular library used for fast inference. By leveraging PagedAttention, dynamic batching, and Hugging Face model integration, vLLM makes inference more efficient and scalable for real-world applications.
#
# Starting with `NNsight 0.6`, NNsight includes support for investigations of vLLM-run models.
# ```python
# # instantiating vllm model
# from nnsight.modeling.vllm import VLLM
#
# vllm = VLLM("model_ID")
# ```
# ## When to Use
# `vLLM` is useful for performance speed-ups, particularly for experiments with multiple batches or generations.
#
# A few considerations when choosing to use `vLLM` for your experiments:
#
# - `NNsight` supports `vLLM` text-generation models. You can find a list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation). Our support currently does not extend to multimodal or image generation models.
# - Be aware that `vLLM` results may also differ from the base `Transformer` model results, even for the same experiment.
# - Note that `vLLM` models do not use gradients, if you want to research gradient methods, use `LanguageModel` instead.
#
# <details>
# <summary>
# More info:
# </summary>
#
# `vLLM` speeds up inference through its paged attention mechanism. This means that accessing gradients and backward passes are not supported for vLLM models. As such, calling gradient operations when using `nnsight` `vLLM` wrappers will throw an error.
# </details>
#
#
#
# ## How to Use
# ### Setup
# You will need to use `nnsight >= 0.6`, `vllm == 0.15.1`, and `triton==3.5.0` to use vLLM with NNsight.
# # vLLM Support
# ## Summary
# [vLLM](https://github.com/vllm-project/vllm) is a popular library used for fast inference. By leveraging PagedAttention, dynamic batching, and Hugging Face model integration, vLLM makes inference more efficient and scalable for real-world applications.
#
# Starting with `NNsight 0.6`, NNsight includes support for investigations of vLLM-run models.
# ```python
# # instantiating vllm model
# from nnsight.modeling.vllm import VLLM
#
# vllm = VLLM("model_ID")
# ```
# ## When to Use
# `vLLM` is useful for performance speed-ups, particularly for experiments with multiple batches or generations.
#
# A few considerations when choosing to use `vLLM` for your experiments:
#
# - `NNsight` supports `vLLM` text-generation models. You can find a list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation). Our support currently does not extend to multimodal or image generation models.
# - Be aware that `vLLM` results may also differ from the base `Transformer` model results, even for the same experiment.
# - Note that `vLLM` models do not use gradients, if you want to research gradient methods, use `LanguageModel` instead.
#
#
#
#
#
#
# ## How to Use
# ### Setup
# You will need to use `nnsight >= 0.6`, `vllm == 0.15.1`, and `triton==3.5.0` to use vLLM with NNsight.# More info: #
# # `vLLM` speeds up inference through its paged attention mechanism. This means that accessing gradients and backward passes are not supported for vLLM models. As such, calling gradient operations when using `nnsight` `vLLM` wrappers will throw an error. #In [ ]:
Copied!
from IPython.display import clear_output
from pprint import pprint
get_ipython().run_line_magic('pip', 'install -U nnsight triton==3.5.0 vllm==0.15.1')
clear_output()
# ### Instantiating vLLM Models
# Next, let's load in our NNsight vLLM model (list of vLLM-supported models & their IDs [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation)).
#
# For this exercise, we will use `meta-llama/Llama-3.1-8B`. Note that Meta gates access to Llama models on Huggingface, so as usual, you will need a HF_TOKEN and approved access, or to use an ungated model.
#
# vLLM models require a supported GPU/backend to run.
from IPython.display import clear_output
from pprint import pprint
get_ipython().run_line_magic('pip', 'install -U nnsight triton==3.5.0 vllm==0.15.1')
clear_output()
# ### Instantiating vLLM Models
# Next, let's load in our NNsight vLLM model (list of vLLM-supported models & their IDs [here](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation)).
#
# For this exercise, we will use `meta-llama/Llama-3.1-8B`. Note that Meta gates access to Llama models on Huggingface, so as usual, you will need a HF_TOKEN and approved access, or to use an ungated model.
#
# vLLM models require a supported GPU/backend to run.
In [ ]:
Copied!
from nnsight.modeling.vllm import VLLM
# vLLM supports explicit parallelism
vllm = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=1, gpu_memory_utilization=0.8)
clear_output()
print(vllm)
# We now have a vLLM model that runs with `nnsight`.
# ### Interventions on vLLM models
# You can access and intervene on model internals in `vLLM` models just like you do for `LanguageModel` models through `nnsight`'s `get` and `set` operations.
#
# ⚠️ **Note**: As mentioned earlier, because differences in `vLLM` inference settings and other implementation details, results may differ compared to `Transformers`, even for the same intervention!
#
# Let's load up a `LanguageModel` instance of the same `vLLM` model so we can compare the two. Here, we're loading in `Llama-3.1-8B` and making an intervention on identified antonym neurons, which should change the output to the antonym of the expected output.
from nnsight.modeling.vllm import VLLM
# vLLM supports explicit parallelism
vllm = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=1, gpu_memory_utilization=0.8)
clear_output()
print(vllm)
# We now have a vLLM model that runs with `nnsight`.
# ### Interventions on vLLM models
# You can access and intervene on model internals in `vLLM` models just like you do for `LanguageModel` models through `nnsight`'s `get` and `set` operations.
#
# ⚠️ **Note**: As mentioned earlier, because differences in `vLLM` inference settings and other implementation details, results may differ compared to `Transformers`, even for the same intervention!
#
# Let's load up a `LanguageModel` instance of the same `vLLM` model so we can compare the two. Here, we're loading in `Llama-3.1-8B` and making an intervention on identified antonym neurons, which should change the output to the antonym of the expected output.
In [ ]:
Copied!
# Use the HuggingFace transformers backend for comparison
from nnsight import LanguageModel
neurons = [394, 5490, 8929]
prompt = "The truth is the"
# Use CUDA_VISIBLE_DEVICES in your env, not tensor_parallel_size
lm = LanguageModel("meta-llama/Llama-3.1-8B", dispatch=True, device_map="auto")
mlp = lm.model.layers[16].mlp.down_proj
with lm.trace(prompt):
mlp.input[:, -1, neurons] = 10 # batch dimension
out = lm.output.save()
last = out["logits"][:, -1].argmax() # dict of tensors
prediction = lm.tokenizer.decode(last).save()
print(f"Prediction with transformers: '{prediction}'")
# Great, the antonym neurons appeared to do their job.
#
# Now, let's intervene on the same neurons for the `vLLM` model and see how the result changes.
# Use the HuggingFace transformers backend for comparison
from nnsight import LanguageModel
neurons = [394, 5490, 8929]
prompt = "The truth is the"
# Use CUDA_VISIBLE_DEVICES in your env, not tensor_parallel_size
lm = LanguageModel("meta-llama/Llama-3.1-8B", dispatch=True, device_map="auto")
mlp = lm.model.layers[16].mlp.down_proj
with lm.trace(prompt):
mlp.input[:, -1, neurons] = 10 # batch dimension
out = lm.output.save()
last = out["logits"][:, -1].argmax() # dict of tensors
prediction = lm.tokenizer.decode(last).save()
print(f"Prediction with transformers: '{prediction}'")
# Great, the antonym neurons appeared to do their job.
#
# Now, let's intervene on the same neurons for the `vLLM` model and see how the result changes.
In [ ]:
Copied!
neurons = [394, 5490, 8929]
prompt = "The truth is the"
mlp = vllm.model.layers[16].mlp.down_proj
with vllm.trace(prompt):
mlp.input = mlp.input.clone()
mlp.input[-1, neurons] = 10 # no batch dimension
out = vllm.output.save()
last = out[:, -1].argmax() # returns a tensor
prediction = vllm.tokenizer.decode(last).save()
print(f"Prediction with vLLM: '{prediction}'")
# As expected, the results were different, indicating that these models are not interchangeable. Keep these differences in mind when working with `vLLM` vs `Transformers` models and making comparisons between the two.
# ### Batching Multiple Prompts
#
# With `LanguageModel`, you can pass multiple prompts to a single invoke:
# ```python
# with lm.trace(["prompt A", "prompt B"]):
# ...
# ```
#
# **With vLLM, each invoke must contain exactly one prompt.** This is because vLLM treats each prompt as an independent request with its own scheduling, sampling parameters, and finish condition. Under the hood, each invoke maps 1:1 to a vLLM request.
#
# To batch multiple prompts, use a **loop of invokes** inside a single trace. vLLM's engine automatically batches the requests together for efficient GPU execution:
neurons = [394, 5490, 8929]
prompt = "The truth is the"
mlp = vllm.model.layers[16].mlp.down_proj
with vllm.trace(prompt):
mlp.input = mlp.input.clone()
mlp.input[-1, neurons] = 10 # no batch dimension
out = vllm.output.save()
last = out[:, -1].argmax() # returns a tensor
prediction = vllm.tokenizer.decode(last).save()
print(f"Prediction with vLLM: '{prediction}'")
# As expected, the results were different, indicating that these models are not interchangeable. Keep these differences in mind when working with `vLLM` vs `Transformers` models and making comparisons between the two.
# ### Batching Multiple Prompts
#
# With `LanguageModel`, you can pass multiple prompts to a single invoke:
# ```python
# with lm.trace(["prompt A", "prompt B"]):
# ...
# ```
#
# **With vLLM, each invoke must contain exactly one prompt.** This is because vLLM treats each prompt as an independent request with its own scheduling, sampling parameters, and finish condition. Under the hood, each invoke maps 1:1 to a vLLM request.
#
# To batch multiple prompts, use a **loop of invokes** inside a single trace. vLLM's engine automatically batches the requests together for efficient GPU execution:
In [ ]:
Copied!
prompts = [
"The Eiffel Tower is in the city of",
"Madison Square Garden is in the city of",
"The Colosseum is in the city of",
]
with vllm.trace(temperature=0.0, top_p=1) as tracer:
predictions = list().save()
for prompt in prompts:
with tracer.invoke(prompt):
token_id = vllm.logits.output.argmax(dim=-1)
predictions.append(vllm.tokenizer.decode(token_id))
for prompt, pred in zip(prompts, predictions):
print(f"{prompt}{pred}")
# Each invoke runs its own intervention code, but vLLM batches the underlying GPU computation across all prompts in the trace.
#
# #### Collecting results across invokes with shared state
#
# You can define variables at the trace scope and reference them inside multiple invokes. This is useful for collecting results from each prompt into a shared structure:
prompts = [
"The Eiffel Tower is in the city of",
"Madison Square Garden is in the city of",
"The Colosseum is in the city of",
]
with vllm.trace(temperature=0.0, top_p=1) as tracer:
predictions = list().save()
for prompt in prompts:
with tracer.invoke(prompt):
token_id = vllm.logits.output.argmax(dim=-1)
predictions.append(vllm.tokenizer.decode(token_id))
for prompt, pred in zip(prompts, predictions):
print(f"{prompt}{pred}")
# Each invoke runs its own intervention code, but vLLM batches the underlying GPU computation across all prompts in the trace.
#
# #### Collecting results across invokes with shared state
#
# You can define variables at the trace scope and reference them inside multiple invokes. This is useful for collecting results from each prompt into a shared structure:
In [ ]:
Copied!
prompts = [
"The Eiffel Tower is in the city of",
"Madison Square Garden is in the city of",
"The Colosseum is in the city of",
]
num_tokens = 5
with vllm.trace(temperature=0.0, top_p=1, max_tokens=num_tokens) as tracer:
# Shared list defined at trace scope — each invoke appends to it
all_tokens = [list() for _ in range(len(prompts))].save()
for i, prompt in enumerate(prompts):
with tracer.invoke(prompt):
# tracer.all() applies to every generation step
with tracer.all():
all_tokens[i].append(vllm.samples.output.item())
for i, prompt in enumerate(prompts):
generated = vllm.tokenizer.decode(all_tokens[i])
print(f"{prompt}{generated}")
# **Key points:**
# - **One prompt per invoke** — use a loop of `tracer.invoke()` calls, not a list of prompts
# - **Shared state** — variables defined at trace scope (like `all_tokens` above) are shared across all invokes; each invoke can read and mutate them
# - **`.save()` on shared variables** — call `.save()` on any trace-scope variable you want to access after the trace exits
# - **Per-invoke sampling params** — you can pass different sampling kwargs to each invoke (e.g., `tracer.invoke(prompt, temperature=0.8)`)
# ### Sampled Token Traceability
# `vLLM` provides functionality to configure how each sequence samples its next token. Here's an example of how you can trace token sampling operations of models with the `nnsight` vLLM wrapper.
prompts = [
"The Eiffel Tower is in the city of",
"Madison Square Garden is in the city of",
"The Colosseum is in the city of",
]
num_tokens = 5
with vllm.trace(temperature=0.0, top_p=1, max_tokens=num_tokens) as tracer:
# Shared list defined at trace scope — each invoke appends to it
all_tokens = [list() for _ in range(len(prompts))].save()
for i, prompt in enumerate(prompts):
with tracer.invoke(prompt):
# tracer.all() applies to every generation step
with tracer.all():
all_tokens[i].append(vllm.samples.output.item())
for i, prompt in enumerate(prompts):
generated = vllm.tokenizer.decode(all_tokens[i])
print(f"{prompt}{generated}")
# **Key points:**
# - **One prompt per invoke** — use a loop of `tracer.invoke()` calls, not a list of prompts
# - **Shared state** — variables defined at trace scope (like `all_tokens` above) are shared across all invokes; each invoke can read and mutate them
# - **`.save()` on shared variables** — call `.save()` on any trace-scope variable you want to access after the trace exits
# - **Per-invoke sampling params** — you can pass different sampling kwargs to each invoke (e.g., `tracer.invoke(prompt, temperature=0.8)`)
# ### Sampled Token Traceability
# `vLLM` provides functionality to configure how each sequence samples its next token. Here's an example of how you can trace token sampling operations of models with the `nnsight` vLLM wrapper.
In [ ]:
Copied!
with vllm.trace("Madison Square Garden is located in the city of", temperature=0.8, top_p=0.95, max_tokens=3) as tracer:
samples = list().save()
logits = list().save()
with tracer.iter[:3]:
logits.append(vllm.logits.output)
samples.append(vllm.samples.output)
pprint(samples)
pprint(logits) # different than samples with current sampling parameters
# ### Other features
#
# #### Intervening on generated token iterations with .all() and .iter[]
# NNSight supports iteration via `all()` and `iter()`
with vllm.trace("Madison Square Garden is located in the city of", temperature=0.8, top_p=0.95, max_tokens=3) as tracer:
samples = list().save()
logits = list().save()
with tracer.iter[:3]:
logits.append(vllm.logits.output)
samples.append(vllm.samples.output)
pprint(samples)
pprint(logits) # different than samples with current sampling parameters
# ### Other features
#
# #### Intervening on generated token iterations with .all() and .iter[]
# NNSight supports iteration via `all()` and `iter()`
In [ ]:
Copied!
with vllm.trace("Hello World!", max_tokens=10) as tracer:
outputs = list().save()
# will iterate over all 10 tokens
with tracer.all():
out = vllm.output[:, -1]
outputs.append(out)
print(len(outputs))
print("".join([vllm.tokenizer.decode(output.argmax()) for output in outputs]))
with vllm.trace("Hello World!", max_tokens=10) as tracer:
outputs = list().save()
# will iterate over all 10 tokens
with tracer.all():
out = vllm.output[:, -1]
outputs.append(out)
print(len(outputs))
print("".join([vllm.tokenizer.decode(output.argmax()) for output in outputs]))
In [ ]:
Copied!
prompt = 'The Eiffel Tower is in the city of'
mlp = vllm.model.layers[16].mlp.down_proj
n_new_tokens = 50
with vllm.trace(prompt, max_tokens=n_new_tokens) as tracer:
hidden_states = list().save() # Initialize & .save() list
# Call .iter() to apply intervention to specific new tokens
with tracer.iter[2:5]:
# Apply intervention - set to zero
mlp.input = mlp.input.clone()
mlp.input[-1] = 0
# Append hidden state post-intervention
hidden_states.append(mlp.input) # no need to call .save
print("Hidden state length: ",len(hidden_states))
pprint(hidden_states)
prompt = 'The Eiffel Tower is in the city of'
mlp = vllm.model.layers[16].mlp.down_proj
n_new_tokens = 50
with vllm.trace(prompt, max_tokens=n_new_tokens) as tracer:
hidden_states = list().save() # Initialize & .save() list
# Call .iter() to apply intervention to specific new tokens
with tracer.iter[2:5]:
# Apply intervention - set to zero
mlp.input = mlp.input.clone()
mlp.input[-1] = 0
# Append hidden state post-intervention
hidden_states.append(mlp.input) # no need to call .save
print("Hidden state length: ",len(hidden_states))
pprint(hidden_states)