vllm¶

vllm ¶

VLLM ¶

VLLM(*args: Any, **kwargs: Any)

Bases: RemoteableMixin

NNsight wrapper to conduct interventions on a vLLM inference engine. Attributes: - vllm_entrypoint (vllm.LLM): vLLM language model. - tokenizer (vllm.transformers_utils.tokenizer.AnyTokenizer): tokenizer. - logits (eproperty): logit tensor. - samples (eproperty): sampled token ids.

.. code-block:: python from nnsight.models.VLLM import VLLM from vllm import SamplingParams

model = VLLM("gpt2")

prompt = ["The Eiffel Tower is in the city of"]

with model.trace(prompt, temperature=0.0, top_p=0.95, stop=['.']) as tracer:
    model.transformer.h[8].output[-1][:] = 0

    output = model.output.save()

print(model.tokenizer.decode(output.value.argmax(dim=-1)[-1]))

logits `instance-attribute` ¶

logits: eproperty

samples `instance-attribute` ¶

samples: eproperty

vllm_entrypoint `instance-attribute` ¶

vllm_entrypoint: LLM = None

tokenizer `instance-attribute` ¶

tokenizer: AnyTokenizer = None

call ¶

__call__(prompts: List[str], params: List[NNsightSamplingParams], lora_requests: List[Any], **kwargs) -> Any

Execute synchronous vLLM generation with NNsight interventions.

Each mediator maps to exactly one prompt/param (1:1).

trace ¶

trace(*inputs, **kwargs)

generate ¶

generate(*inputs, **kwargs)

Alias for :meth:trace to match the :class:LanguageModel API.

vLLM tracing is inherently multi-token (driven by max_tokens), so there's no separate generate vs forward distinction like there is for HuggingFace causal LMs. max_new_tokens is accepted for cross-API portability and rewritten to max_tokens.

interleave ¶

interleave(fn: Callable, *args: Any, **kwargs: Any) -> None

Execute the traced function with vLLM, dispatching the engine if needed.

getstate ¶

__getstate__()

setstate ¶

__setstate__(state)

vllm¶

vllm ¶

VLLM ¶

logits instance-attribute ¶

samples instance-attribute ¶

vllm_entrypoint instance-attribute ¶

tokenizer instance-attribute ¶

__call__ ¶