Remote Execution¶

NDIF is the backend service for nnsight that lets you run interventions on large models without a local GPU. Your code is packaged, sent to NDIF's servers, executed on the model, and only the .save()d results are returned over the internet.

You can access models up to 405B+ parameters — the same nnsight code you write locally works remotely by adding remote=True.

Setup¶

In [ ]:

Copied!

import nnsight
from nnsight import LanguageModel, CONFIG
import nnsight
from nnsight import LanguageModel, CONFIG

Getting an NDIF Account¶

To use NDIF, create a free account at login.ndif.us to get your API key. There are three ways to set it:

Option 1: Config (persistent)

from nnsight import CONFIG
CONFIG.set_default_api_key("your-api-key")  # Saved to disk

Option 2: Environment variable

export NDIF_API_KEY="your-api-key"

Option 3: Google Colab secret

Add a secret named NDIF_API_KEY in Colab's Secrets panel — nnsight picks it up automatically.

Checking Available Models¶

Use nnsight.ndif_status() to see which models are currently deployed, or visit the status page.

In [ ]:

Copied!

print(nnsight.ndif_status())
print(nnsight.ndif_status())

You can also check if a specific model is available:

In [ ]:

Copied!

print(f"GPT-2 running: {nnsight.is_model_running('openai-community/gpt2')}")
print(f"GPT-2 running: {nnsight.is_model_running('openai-community/gpt2')}")

GPT-2 running: True

Hotswapping pilot program

NDIF is piloting a hotswapping feature: request any supported model and it will be auto-deployed if there is available capacity. This is coming soon to all users. If you'd like early access, ask on the NDIF Discord.

Comparing Environments¶

Before running remotely, you can compare your local Python environment with NDIF's to catch potential version mismatches:

from nnsight import ndif
ndif.compare()

This displays a table highlighting package discrepancies, with critical packages (nnsight, transformers, torch) flagged prominently. Note that you no longer need the same Python version as the server — nnsight's serialization format supports Python 3.9+ regardless of server version.

Loading a Remote Model¶

When you instantiate a model for remote use, no weights are downloaded — just a lightweight skeleton on the meta device.

In [ ]:

Copied!

model = LanguageModel("openai-community/gpt2")

print(f"Device: {model.device}")
model = LanguageModel("openai-community/gpt2")

print(f"Device: {model.device}")

Device: meta

No GPU memory used

The model is loaded on the meta device, which means no actual parameters are allocated. This is just a skeleton that lets you define interventions. For large models like Llama-3.1-405B, this takes seconds instead of the minutes needed to download and load weights locally.

Basic Remote Tracing¶

Add remote=True to any .trace() call to execute on NDIF. Your interventions are packaged, sent to the server, and only the .save()d results come back.

In [ ]:

Copied!

with model.trace("The Eiffel Tower is in the city of", remote=True):
    logit = model.lm_head.output[0, -1].argmax(dim=-1).save()

print(f"Prediction: {model.tokenizer.decode(logit)}")
with model.trace("The Eiffel Tower is in the city of", remote=True):
    logit = model.lm_head.output[0, -1].argmax(dim=-1).save()

print(f"Prediction: {model.tokenizer.decode(logit)}")

Prediction:  Paris

The status logs show the lifecycle of your request:

Status	Meaning
`RECEIVED`	Request validated with your API key
`QUEUED`	Waiting in the model's queue
`DISPATCHED`	Sent to the model deployment
`RUNNING`	Your interventions are executing
`LOG`	Print statements from your code
`COMPLETED`	Results ready for download

You can disable these logs:

CONFIG.APP.REMOTE_LOGGING = False

Remote Generation¶

Multi-token generation works the same way:

In [ ]:

Copied!





with model.generate("The Eiffel Tower is in the city of", max_new_tokens=3, remote=True) as tracer:
    tokens = list().save()
    for step in tracer.iter[:]:
        tokens.append(model.lm_head.output[0, -1].argmax(dim=-1))

for i, t in enumerate(tokens):
    print(f"Step {i}: {model.tokenizer.decode(t)}")
with model.generate("The Eiffel Tower is in the city of", max_new_tokens=3, remote=True) as tracer:
    tokens = list().save()
    for step in tracer.iter[:]:
        tokens.append(model.lm_head.output[0, -1].argmax(dim=-1))

for i, t in enumerate(tokens):
    print(f"Step {i}: {model.tokenizer.decode(t)}")

Step 0:  Paris
Step 1: ,
Step 2:  and

Saving Results¶

Only .save()d values are transferred back from the server. This is important: unsaved values exist only on the remote machine and are discarded after the request completes.

Minimize what you save

Every .save() call transfers data over the internet. For large tensors, use .detach().cpu() before saving to minimize download size. Only save what you actually need — argmax indices instead of full logit tensors, specific positions instead of full sequences.

Sessions¶

A session bundles multiple traces into a single request. This means one queue wait, and values from earlier traces are available in later ones without needing .save().

In [ ]:

Copied!





with model.session(remote=True):
    # First trace: capture hidden states from Eiffel Tower prompt
    with model.trace("The Eiffel Tower is in the city of"):
        hs = model.transformer.h[-1].output[0][:, -1, :]  # No .save() needed within session

    # Second trace: clean baseline for Colosseum
    with model.trace("The Colosseum is in the city of"):
        clean = model.lm_head.output[0, -1].argmax(dim=-1).save()

    # Third trace: patch Eiffel Tower hidden states into Colosseum prompt
    with model.trace("The Colosseum is in the city of"):
        model.transformer.h[-1].output[0][:, -1, :] = hs
        patched = model.lm_head.output[0, -1].argmax(dim=-1).save()

print(f"Clean:   {model.tokenizer.decode(clean)}")
print(f"Patched: {model.tokenizer.decode(patched)}")
with model.session(remote=True):
    # First trace: capture hidden states from Eiffel Tower prompt
    with model.trace("The Eiffel Tower is in the city of"):
        hs = model.transformer.h[-1].output[0][:, -1, :]  # No .save() needed within session

    # Second trace: clean baseline for Colosseum
    with model.trace("The Colosseum is in the city of"):
        clean = model.lm_head.output[0, -1].argmax(dim=-1).save()

    # Third trace: patch Eiffel Tower hidden states into Colosseum prompt
    with model.trace("The Colosseum is in the city of"):
        model.transformer.h[-1].output[0][:, -1, :] = hs
        patched = model.lm_head.output[0, -1].argmax(dim=-1).save()

print(f"Clean:   {model.tokenizer.decode(clean)}")
print(f"Patched: {model.tokenizer.decode(patched)}")

Clean:    P
Patched:  Paris

When to use sessions

Activation patching — capture from one prompt, patch into another, all in one request
Comparing interventions — run clean and modified traces back-to-back with shared state
Multi-step experiments — chain traces that build on each other without round-trips

Sessions only require remote=True on the outer model.session() call.

Remote Gradients¶

Gradients are disabled on NDIF by default for performance. To enable them, set requires_grad = True on the tensor at the earliest point you need gradients.

In [ ]:

Copied!





with model.trace("The Eiffel Tower is in the city of", remote=True):
    hs = model.transformer.h[5].output[0]
    hs.requires_grad = True

    logits = model.lm_head.output

    with logits.sum().backward():
        grad = hs.grad.save()

print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm():.4f}")
with model.trace("The Eiffel Tower is in the city of", remote=True):
    hs = model.transformer.h[5].output[0]
    hs.requires_grad = True

    logits = model.lm_head.output

    with logits.sum().backward():
        grad = hs.grad.save()

print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm():.4f}")

Gradient shape: torch.Size([1, 10, 768])
Gradient norm: 954368.0000

Gradient scope

Setting requires_grad = True enables gradient tracking from that point forward. Only layers after the point where you set it will have gradients. Set it as early as needed, but no earlier than necessary — gradient computation uses more memory on the server.

Whitelisted Packages¶

Your intervention code runs on NDIF's servers, so only approved packages are available. The current whitelist includes:

builtins, math, typing, collections, time
torch, numpy, einops, sympy
nnterp

Standard Python operations and PyTorch code work as expected inside traces.

Registering Custom Code¶

If your experiment uses functions or classes from your own local modules, register them so they're serialized and sent with your request:

from nnsight import ndif
import mymodule

ndif.register(mymodule)

model = LanguageModel("meta-llama/Llama-3.1-70B")

with model.trace("Hello world", remote=True):
    result = mymodule.my_analysis(model).save()

Code defined directly in your script or working directory is auto-registered — ndif.register() is only needed for pip-installed local packages.

How registration works

nnsight serializes registered functions and classes by value (their source code) rather than by reference. This means they're rebuilt on the server, even if the package isn't installed there. This also means you no longer need the same Python version as the server.

Print Statements¶

print() inside a remote trace appears as LOG status messages — useful for debugging without saving intermediate values.

In [ ]:

Copied!





with model.trace("The Eiffel Tower is in the city of", remote=True):
    logits = model.lm_head.output
    pred = logits[0, -1].argmax(dim=-1)
    print(f"Predicted token id: {pred}")
    decoded = model.tokenizer.decode(pred).save()

print(f"Result: {decoded}")
with model.trace("The Eiffel Tower is in the city of", remote=True):
    logits = model.lm_head.output
    pred = logits[0, -1].argmax(dim=-1)
    print(f"Predicted token id: {pred}")
    decoded = model.tokenizer.decode(pred).save()

print(f"Result: {decoded}")

Result:  Paris

Usage Limits¶

Each request has a one hour time limit. If your intervention exceeds this, the request will be terminated.

We're still figuring out the best way to configure the system for fairness across all users. If you run into limits or have feedback, let us know on the NDIF Discord.