Remote Execution¶
NDIF is the backend service for nnsight that lets you run interventions on large models without a local GPU. Your code is packaged, sent to NDIF's servers, executed on the model, and only the .save()d results are returned over the internet.
You can access models up to 405B+ parameters — the same nnsight code you write locally works remotely by adding remote=True.
Setup¶
import nnsight
from nnsight import LanguageModel, CONFIG
Getting an NDIF Account¶
To use NDIF, create a free account at login.ndif.us to get your API key. There are three ways to set it:
Option 1: Config (persistent)
from nnsight import CONFIG
CONFIG.set_default_api_key("your-api-key") # Saved to disk
Option 2: Environment variable
export NDIF_API_KEY="your-api-key"
Option 3: Google Colab secret
Add a secret named NDIF_API_KEY in Colab's Secrets panel — nnsight picks it up automatically.
Checking Available Models¶
Use nnsight.ndif_status() to see which models are currently deployed, or visit the status page.
print(nnsight.ndif_status())
You can also check if a specific model is available:
print(f"GPT-2 running: {nnsight.is_model_running('openai-community/gpt2')}")
GPT-2 running: True
Hotswapping pilot program
NDIF is piloting a hotswapping feature: request any supported model and it will be auto-deployed if there is available capacity. This is coming soon to all users. If you'd like early access, ask on the NDIF Discord.
Comparing Environments¶
Before running remotely, you can compare your local Python environment with NDIF's to catch potential version mismatches:
from nnsight import ndif
ndif.compare()
This displays a table highlighting package discrepancies, with critical packages (nnsight, transformers, torch) flagged prominently. Note that you no longer need the same Python version as the server — nnsight's serialization format supports Python 3.9+ regardless of server version.
Loading a Remote Model¶
When you instantiate a model for remote use, no weights are downloaded — just a lightweight skeleton on the meta device.
model = LanguageModel("openai-community/gpt2")
print(f"Device: {model.device}")
Device: meta
No GPU memory used
The model is loaded on the meta device, which means no actual parameters are allocated. This is just a skeleton that lets you define interventions. For large models like Llama-3.1-405B, this takes seconds instead of the minutes needed to download and load weights locally.
Basic Remote Tracing¶
Add remote=True to any .trace() call to execute on NDIF. Your interventions are packaged, sent to the server, and only the .save()d results come back.
with model.trace("The Eiffel Tower is in the city of", remote=True):
logit = model.lm_head.output[0, -1].argmax(dim=-1).save()
print(f"Prediction: {model.tokenizer.decode(logit)}")
Prediction: Paris
The status logs show the lifecycle of your request:
| Status | Meaning |
|---|---|
RECEIVED |
Request validated with your API key |
QUEUED |
Waiting in the model's queue |
DISPATCHED |
Sent to the model deployment |
RUNNING |
Your interventions are executing |
LOG |
Print statements from your code |
COMPLETED |
Results ready for download |
You can disable these logs:
CONFIG.APP.REMOTE_LOGGING = False
Remote Generation¶
Multi-token generation works the same way:
with model.generate("The Eiffel Tower is in the city of", max_new_tokens=3, remote=True) as tracer:
tokens = list().save()
for step in tracer.iter[:]:
tokens.append(model.lm_head.output[0, -1].argmax(dim=-1))
for i, t in enumerate(tokens):
print(f"Step {i}: {model.tokenizer.decode(t)}")
Step 0: Paris Step 1: , Step 2: and
Saving Results¶
Only .save()d values are transferred back from the server. This is important: unsaved values exist only on the remote machine and are discarded after the request completes.
Minimize what you save
Every .save() call transfers data over the internet. For large tensors, use .detach().cpu() before saving to minimize download size. Only save what you actually need — argmax indices instead of full logit tensors, specific positions instead of full sequences.
Sessions¶
A session bundles multiple traces into a single request. This means one queue wait, and values from earlier traces are available in later ones without needing .save().
with model.session(remote=True):
# First trace: capture hidden states from Eiffel Tower prompt
with model.trace("The Eiffel Tower is in the city of"):
hs = model.transformer.h[-1].output[0][:, -1, :] # No .save() needed within session
# Second trace: clean baseline for Colosseum
with model.trace("The Colosseum is in the city of"):
clean = model.lm_head.output[0, -1].argmax(dim=-1).save()
# Third trace: patch Eiffel Tower hidden states into Colosseum prompt
with model.trace("The Colosseum is in the city of"):
model.transformer.h[-1].output[0][:, -1, :] = hs
patched = model.lm_head.output[0, -1].argmax(dim=-1).save()
print(f"Clean: {model.tokenizer.decode(clean)}")
print(f"Patched: {model.tokenizer.decode(patched)}")
Clean: P Patched: Paris
When to use sessions
- Activation patching — capture from one prompt, patch into another, all in one request
- Comparing interventions — run clean and modified traces back-to-back with shared state
- Multi-step experiments — chain traces that build on each other without round-trips
Sessions only require remote=True on the outer model.session() call.
Remote Gradients¶
Gradients are disabled on NDIF by default for performance. To enable them, set requires_grad = True on the tensor at the earliest point you need gradients.
with model.trace("The Eiffel Tower is in the city of", remote=True):
hs = model.transformer.h[5].output[0]
hs.requires_grad = True
logits = model.lm_head.output
with logits.sum().backward():
grad = hs.grad.save()
print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm():.4f}")
Gradient shape: torch.Size([1, 10, 768]) Gradient norm: 954368.0000
Gradient scope
Setting requires_grad = True enables gradient tracking from that point forward. Only layers after the point where you set it will have gradients. Set it as early as needed, but no earlier than necessary — gradient computation uses more memory on the server.
Whitelisted Packages¶
Your intervention code runs on NDIF's servers, so only approved packages are available. The current whitelist includes:
builtins,math,typing,collections,timetorch,numpy,einops,sympynnterp
Standard Python operations and PyTorch code work as expected inside traces.
Registering Custom Code¶
If your experiment uses functions or classes from your own local modules, register them so they're serialized and sent with your request:
from nnsight import ndif
import mymodule
ndif.register(mymodule)
model = LanguageModel("meta-llama/Llama-3.1-70B")
with model.trace("Hello world", remote=True):
result = mymodule.my_analysis(model).save()
Code defined directly in your script or working directory is auto-registered — ndif.register() is only needed for pip-installed local packages.
How registration works
nnsight serializes registered functions and classes by value (their source code) rather than by reference. This means they're rebuilt on the server, even if the package isn't installed there. This also means you no longer need the same Python version as the server.
Print Statements¶
print() inside a remote trace appears as LOG status messages — useful for debugging without saving intermediate values.
with model.trace("The Eiffel Tower is in the city of", remote=True):
logits = model.lm_head.output
pred = logits[0, -1].argmax(dim=-1)
print(f"Predicted token id: {pred}")
decoded = model.tokenizer.decode(pred).save()
print(f"Result: {decoded}")
Result: Paris
Usage Limits¶
Each request has a one hour time limit. If your intervention exceeds this, the request will be terminated.
We're still figuring out the best way to configure the system for fairness across all users. If you run into limits or have feedback, let us know on the NDIF Discord.