vllm¶
vllm
¶
VLLM
¶
Bases: RemoteableMixin
NNsight wrapper to conduct interventions on a vLLM inference engine. Attributes: - vllm_entrypoint (vllm.LLM): vLLM language model. - tokenizer (vllm.transformers_utils.tokenizer.AnyTokenizer): tokenizer. - logits (eproperty): logit tensor. - samples (eproperty): sampled token ids.
.. code-block:: python from nnsight.models.VLLM import VLLM from vllm import SamplingParams
model = VLLM("gpt2")
prompt = ["The Eiffel Tower is in the city of"]
with model.trace(prompt, temperature=0.0, top_p=0.95, stop=['.']) as tracer:
model.transformer.h[8].output[-1][:] = 0
output = model.output.save()
print(model.tokenizer.decode(output.value.argmax(dim=-1)[-1]))
__call__
¶
__call__(prompts: List[str], params: List[NNsightSamplingParams], lora_requests: List[Any], **kwargs) -> Any
Execute synchronous vLLM generation with NNsight interventions.
Each mediator maps to exactly one prompt/param (1:1).
generate
¶
Alias for :meth:trace to match the :class:LanguageModel API.
vLLM tracing is inherently multi-token (driven by max_tokens),
so there's no separate generate vs forward distinction like there
is for HuggingFace causal LMs. max_new_tokens is accepted for
cross-API portability and rewritten to max_tokens.
interleave
¶
Execute the traced function with vLLM, dispatching the engine if needed.