Skip to content

vllm

vllm

VLLM

VLLM(*args: Any, **kwargs: Any)

Bases: RemoteableMixin

NNsight wrapper to conduct interventions on a vLLM inference engine. Attributes: - vllm_entrypoint (vllm.LLM): vLLM language model. - tokenizer (vllm.transformers_utils.tokenizer.AnyTokenizer): tokenizer. - logits (eproperty): logit tensor. - samples (eproperty): sampled token ids.

.. code-block:: python from nnsight.models.VLLM import VLLM from vllm import SamplingParams

model = VLLM("gpt2")

prompt = ["The Eiffel Tower is in the city of"]

with model.trace(prompt, temperature=0.0, top_p=0.95, stop=['.']) as tracer:
    model.transformer.h[8].output[-1][:] = 0

    output = model.output.save()

print(model.tokenizer.decode(output.value.argmax(dim=-1)[-1]))

logits instance-attribute

logits: eproperty

samples instance-attribute

samples: eproperty

vllm_entrypoint instance-attribute

vllm_entrypoint: LLM = None

tokenizer instance-attribute

tokenizer: AnyTokenizer = None

__call__

__call__(prompts: List[str], params: List[NNsightSamplingParams], lora_requests: List[Any], **kwargs) -> Any

Execute synchronous vLLM generation with NNsight interventions.

Each mediator maps to exactly one prompt/param (1:1).

trace

trace(*inputs, **kwargs)

generate

generate(*inputs, **kwargs)

Alias for :meth:trace to match the :class:LanguageModel API.

vLLM tracing is inherently multi-token (driven by max_tokens), so there's no separate generate vs forward distinction like there is for HuggingFace causal LMs. max_new_tokens is accepted for cross-API portability and rewritten to max_tokens.

interleave

interleave(fn: Callable, *args: Any, **kwargs: Any) -> None

Execute the traced function with vLLM, dispatching the engine if needed.

__getstate__

__getstate__()

__setstate__

__setstate__(state)