vLLM Support#

vLLM is a popular library used for fast inference. By leveraging PagedAttention, dynamic batching, and Hugging Face model integration, vLLM makes inference more efficient and scalable for real-world applications.

Starting with NNsight 0.4, NNsight includes support for internal investigations of vLLM models.

Setup#

You will need to install nnsight 0.4, vllm, and triton 3.1.0 to use vLLM with NNsight.

[1]:
from IPython.display import clear_output
try:
    import google.colab
    is_colab = True
except ImportError:
    is_colab = False

if is_colab:
    !pip install -U nnsight
clear_output()
[2]:
# install vllm
!pip install vllm==0.6.4.post1

# install triton 3.1.0
!pip install triton==3.1.0

clear_output()

Next, let’s load in our NNsight-supported vLLM model. You can find vLLM-supported models here. For this exercise, we will use GPT-2.

Please note that vLLM models require a GPU to run.

[1]:
from IPython.display import clear_output
from nnsight.modeling.vllm import VLLM

# NNsight's VLLM wrapper currently supports "device = cuda" and device = "auto"
vllm = VLLM("gpt2", device = "auto", dispatch = True) # See supported models: https://docs.vllm.ai/en/latest/models/supported_models.html

clear_output()
print(vllm)
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): VocabParallelEmbedding(num_embeddings=50304, embedding_dim=768, org_vocab_size=50257, num_embeddings_padded=50304, tp_size=1)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): QKVParallelLinear(in_features=768, output_features=2304, bias=True, tp_size=1, gather_output=False)
          (c_proj): RowParallelLinear(input_features=768, output_features=768, bias=True, tp_size=1, reduce_results=True)
          (attn): Attention(head_size=64, num_heads=12, num_kv_heads=12, scale=0.125, backend=XFormersImpl)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): ColumnParallelLinear(in_features=768, output_features=3072, bias=True, tp_size=1, gather_output=False)
          (c_proj): RowParallelLinear(input_features=3072, output_features=768, bias=True, tp_size=1, reduce_results=True)
          (act): NewGELU()
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): VocabParallelEmbedding(num_embeddings=50304, embedding_dim=768, org_vocab_size=50257, num_embeddings_padded=50304, tp_size=1)
  (logits_processor): LogitsProcessor(vocab_size=50257, forg_vocab_size=50257, scale=1.0, logits_as_input=False)
  (sampler): Sampler()
  (logits): WrapperModule()
  (samples): WrapperModule()
)

Interventions on vLLM models#

We now have a vLLM model that runs with nnsight. Let’s try applying some interventions on it.

Note that vLLM takes in sampling parameters including temperature and top_p. These parameters can be included in the .trace() or .invoke() contexts. For default model behavior, set temperature = 0 and top_p = 1. For more information about parameters, reference the vLLM documentation.

[2]:
with vllm.trace(temperature=0.0, top_p=1.0, max_tokens=1) as tracer:
  with tracer.invoke("The Eiffel Tower is located in the city of"):
    clean_logits = vllm.logits.output.save()

  with tracer.invoke("The Eiffel Tower is located in the city of"):
    vllm.transformer.h[-2].mlp.output[:] = 0
    corrupted_logits = vllm.logits.output.save()
Processed prompts: 100%|██████████| 2/2 [00:00<00:00, 52.35it/s, est. speed input: 577.49 toks/s, output: 52.49 toks/s]
[3]:
print("\nCLEAN - The Eiffel Tower is located in the city of", vllm.tokenizer.decode(clean_logits.argmax(dim=-1)))
print("\nCORRUPTED - The Eiffel Tower is located in the city of", vllm.tokenizer.decode(corrupted_logits.argmax(dim=-1)))

CLEAN - The Eiffel Tower is located in the city of  Paris

CORRUPTED - The Eiffel Tower is located in the city of  London

We’ve successfully performed an intervention on our vLLM model!

Sampled Token Traceability#

vLLM provides functionality to configure how each sequence samples its next token. Here’s an example of how you can trace token sampling operations with the nnsight VLLM wrapper.

[4]:
import nnsight
with vllm.trace("Madison Square Garden is located in the city of", temperature=0.8, top_p=0.95, max_tokens=3) as tracer:
    samples = nnsight.list().save()
    logits = nnsight.list().save()

    for ii in range(3):
        samples.append(vllm.samples.output)
        vllm.samples.next()
        logits.append(vllm.logits.output)
        vllm.logits.next()

print("Samples: ", samples)
print("Logits: ", logits) # different than samples with current sampling parameters
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  3.98it/s, est. speed input: 36.59 toks/s, output: 12.20 toks/s]
Samples:  [tensor([16940]), tensor([319]), tensor([262])]
Logits:  [tensor([[-109.0625, -107.9375, -111.6875,  ..., -115.3750, -116.5625,
         -108.8750]], device='cuda:0', dtype=torch.float16), tensor([[ -88.1250,  -89.4375,  -93.4375,  ..., -101.5625,  -98.7500,
          -90.2500]], device='cuda:0', dtype=torch.float16), tensor([[-90.1875, -89.0000, -92.6250,  ..., -96.7500, -95.2500, -88.8125]],
       device='cuda:0', dtype=torch.float16)]

Note: gradients are not supported with vLLM

vLLM speeds up inference through its paged attention mechanism. This means that accessing gradients and backward passes are not supported for vLLM models. As such, calling gradient operations when using nnsight vLLM wrappers will throw an error.

Known Issues#

  • The vllm.LLM engine performs max_tokens + 1 forward passes which can lead to undesired behavior if you are running interventions on all iterations of multi-token generation.

Example:

with vllm_gpt2("Hello World!", max_tokens=10):
    logits = nnsight.list().save()
    with vllm_gpt2.logits.all():
        logits.append(vllm_gpt2.logits.output)

print(len(logits))

>>> 11 # expected: 10

[ ]:
with vllm.trace(temperature=0.0, top_p=1.0, max_tokens=1) as tracer:
  with tracer.invoke("The Eiffel Tower is located in the city of"):
    clean_logits = vllm.logits.output.save()

  with tracer.invoke("The Eiffel Tower is located in the city of"):
    vllm.language_model.model.layers[-2].mlp.output[:] = 0
    corrupted_logits = vllm.logits.output.save()
Processed prompts: 100%|██████████| 2/2 [00:00<00:00, 28.61it/s, est. speed input: 315.26 toks/s, output: 28.66 toks/s]
[ ]:
print("\nCLEAN - The Eiffel Tower is located in the city of", vllm.tokenizer.decode(clean_logits.argmax(dim=-1)))
print("\nCORRUPTED - The Eiffel Tower is located in the city of", vllm.tokenizer.decode(corrupted_logits.argmax(dim=-1)))

CLEAN - The Eiffel Tower is located in the city of  Paris

CORRUPTED - The Eiffel Tower is located in the city of  Paris