Remote Execution#

NDIF#

NDIF is the complementary service to nnsight which allows users to perform interventions without model loading or requiring a GPU. Through NDIF, users can have access to a variety of models, remotely, with a wide range of sizes (up to 400+ Billion parameters!).

You can visit https://nnsight.net/status/ to check our current model serving, or, use the nnsight API to get the current status of the backend.

[1]:

import nnsight

nnsight.ndif_status()

/share/u/adam/miniconda3/envs/nns/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[1]:

NDIF Service: Up 🟢

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model Class   ┃ Repo ID                            ┃ Revision ┃ Type       ┃ Status       ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ LanguageModel │ meta-llama/Llama-3.1-8B-Instruct   │ main     │ Pilot-Only │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-3.1-405B          │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-3.1-70B-Instruct  │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-3.3-70B-Instruct  │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-3.1-70B           │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ openai-community/gpt2              │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ allenai/Olmo-3-1025-7B             │ main     │ Pilot-Only │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-3.1-8B            │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-2-7b-hf           │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ EleutherAI/gpt-j-6b                │ main     │ Dedicated  │ RUNNING      │
│ LanguageModel │ meta-llama/Llama-3.1-405B-Instruct │ main     │ Scheduled  │ NOT DEPLOYED │
└───────────────┴────────────────────────────────────┴──────────┴────────────┴──────────────┘

This status result shows that NDIF is currently up and running with multiple models ready for use:

Dedicated models indicate deployments that are meant to be permanently served.
Scheduled models are rotating models that will be served following a deployment schedule, which can be found at NDIF schedule.
Pilot-Only models are reserved for our Hot-Swapping program participants and require additional permissions.

You can also check for the availability of individual models

[2]:

nnsight.is_model_running("meta-llama/Llama-3.1-8B")

[2]:

True

API Key#

To access remote functionality on NDIF, you need to claim an API key at https://login.ndif.us.

Once you completed setting up your account, paste your API key in the code below to save it for all future use.

[8]:

from nnsight import CONFIG

CONFIG.set_default_api_key("<your api key>")

(Note: Make sure your HF_TOKEN is set to your environment. This step is required for both local and remote execution. More info at: https://huggingface.co/docs/huggingface_hub/en/quick-start)

Compatibility Requirements#

NDIF requests requires the client to use python == 3.12.*.

As of 11/11/25 nnsight >= 0.5.13 is the minimum API version supported for remote execution.

Remote Model Access with NDIF#

We see in the previous section that meta-llama/Llama-3.1-8B is deployed and running on NDIF, so let’s try it out.

First, let’s instantiate the model. Don’t worry! This is not actually going to load any weights, instead it’s only going to load a skeleton of the model on the meta device. This will allow you to define your interventions on the model components before being packaged and sent to the remote backend to be executed.

[5]:

from nnsight import LanguageModel

model = LanguageModel("meta-llama/Llama-3.1-8B")

print("Device:", model.device)
print(model)

Device: meta
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
  (generator): Generator(
    (streamer): Streamer()
  )
)

That was quick! In a local setting, this operation would normally take around 10 mins to download and load the model to GPU given enough resources are provided in the first place.

Now, let’s run our remote execution with nnsight. This is as simple as passing remote=True to your tracing context ->

[7]:

assert nnsight.is_model_running("meta-llama/Llama-3.1-8B"), "Model is not running"

with model.trace("The Eiffel Tower is located in the city of", remote=True):
    logit = model.lm_head.output[0][-1].argmax(dim=-1).save()

print("Prediction:", model.tokenizer.decode(logit))

[2025-11-12 17:30:09] [c83736b6-8975-4d38-9dd7-ee50f618bd0c] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:30:09] [c83736b6-8975-4d38-9dd7-ee50f618bd0c] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:30:09] [c83736b6-8975-4d38-9dd7-ee50f618bd0c] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:30:09] [c83736b6-8975-4d38-9dd7-ee50f618bd0c] RUNNING    : Your job has started running.
[2025-11-12 17:30:10] [c83736b6-8975-4d38-9dd7-ee50f618bd0c] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.64k/1.64k [00:00<00:00, 13.3MB/s]

Prediction:  Paris

Your request was executed remotely on the NDIF backend!

The logs indicate the stages that your remote request goes through.

RECEIVED is when your request has been validated with an authorized api key and a supported python and nnsight version.
QUEUED, each model deployment maintains an independent queue, which process requests sequentially - FIFO. This log status updates the user on the position of their request in the queue until it is next in line to be executed.
DISPATCHED, the request is being forwarded to the model deployment.
RUNNING, your request is being interleaved with the model execution.
COMPLETED, the request finished executing and the saved results are available to be downloaded.

You can disable these logs by doing:

from nnsight import CONFIG
CONFIG.APP.REMOTE_LOGGING = False

Now let’s discover more remote functionalities and best practices.

Remote Model Parameters#

To access specific revisions of a model deployed on NDIF, you can pass the revision name to the model instantiation like you would normally do with the transformers library revision.

[8]:

model = LanguageModel("meta-llama/Llama-3.1-8B", revision="main")

Additionally, you can also leverage the nnsight renaming feature with remote models the same way it works locally:

[9]:

model = LanguageModel("meta-llama/Llama-3.1-8B", rename={"lm_head": "unembed"})

with model.trace("The Eiffel Tower is located in the city of", remote=True):
    logit = model.unembed.output[0][-1].argmax(dim=-1).save() # access lm_head as unembed

print("Prediction:", model.tokenizer.decode(logit))

[2025-11-12 17:30:21] [c0542812-dcef-4e1c-ac1e-57554795896f] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:30:21] [c0542812-dcef-4e1c-ac1e-57554795896f] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:30:21] [c0542812-dcef-4e1c-ac1e-57554795896f] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:30:21] [c0542812-dcef-4e1c-ac1e-57554795896f] RUNNING    : Your job has started running.
[2025-11-12 17:30:22] [c0542812-dcef-4e1c-ac1e-57554795896f] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.64k/1.64k [00:00<00:00, 13.1MB/s]

Prediction:  Paris

Saving Results#

Similar to local nnsight execution, calling .save() on desired objects is required to access them outside of the remote tracing context.

However, the distinction is that calling .save() during remote execution is essential for fetching all desired results, including updates states.

Let’s take a look at this use case:

[12]:

logits_l = list()
with model.generate("Madison Square Garden is located in the city of", min_new_tokens=5, max_new_tokens=5, remote=True) as tracer:
    with tracer.all():
        logits_l.append(model.lm_head.output[0].save())

    print(f"List length is {len(logits_l)}")

assert len(logits_l) == 5, f"List length is {len(logits_l)}"

[2025-11-12 17:30:52] [95e28585-e9f1-4852-ac5b-517056c58898] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:30:52] [95e28585-e9f1-4852-ac5b-517056c58898] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:30:52] [95e28585-e9f1-4852-ac5b-517056c58898] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:30:53] [95e28585-e9f1-4852-ac5b-517056c58898] RUNNING    : Your job has started running.
[2025-11-12 17:30:53] [95e28585-e9f1-4852-ac5b-517056c58898] LOG        : List length is 5
[2025-11-12 17:30:53] [95e28585-e9f1-4852-ac5b-517056c58898] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.32k/1.32k [00:00<00:00, 10.4MB/s]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[12], line 8
      4         logits_l.append(model.lm_head.output[0].save())
      6     print(f"List length is {len(logits_l)}")
----> 8 assert len(logits_l) == 5, f"List length is {len(logits_l)}"

AssertionError: List length is 0

Although the list is being populated correctly during the interleaved execution on NDIF, the state of the local logits_l list is not updated after completion.

To acheive desired results, we recommend instanting objects to be mutated directly in the tracing context and saving them to be accessed later on.

[13]:

with model.generate("Madison Square Garden is located in the city of", min_new_tokens=5, max_new_tokens=5, remote=True) as tracer:
    logits_l = list().save()
    with tracer.all():
        logits_l.append(model.lm_head.output[0].save())

assert len(logits_l) == 5, f"List length is {len(logits_l)}"

[2025-11-12 17:31:04] [1ad5e082-5a3a-4a5f-964d-9da42340ec03] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:31:04] [1ad5e082-5a3a-4a5f-964d-9da42340ec03] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:31:04] [1ad5e082-5a3a-4a5f-964d-9da42340ec03] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:31:04] [1ad5e082-5a3a-4a5f-964d-9da42340ec03] RUNNING    : Your job has started running.
[2025-11-12 17:31:05] [1ad5e082-5a3a-4a5f-964d-9da42340ec03] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.29M/1.29M [00:01<00:00, 1.09MB/s]

Saving Tensors#

When saving torch.Tensor objects, the best practice is to move the object to cpu memory (and detaching them if applicable) before saving, for minimal download size and latency.

[22]:

with model.trace("The Eiffel Tower is located in the city of", remote=True):
    logit = model.lm_head.output.detach().cpu().save()

[2025-11-12 17:59:49] [dde5c906-fb30-45e8-bdf0-369a28e657ee] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:59:49] [dde5c906-fb30-45e8-bdf0-369a28e657ee] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:59:49] [dde5c906-fb30-45e8-bdf0-369a28e657ee] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:59:50] [dde5c906-fb30-45e8-bdf0-369a28e657ee] RUNNING    : Your job has started running.
[2025-11-12 17:59:50] [dde5c906-fb30-45e8-bdf0-369a28e657ee] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 3.08M/3.08M [00:09<00:00, 330kB/s]

Sessions#

Sometimes you experiment implements an intervention scheme over and across multiple forward passes. Having to download results from one trace just for the sake of using it in another remote trace call would be slow and inefficient. Another bottleneck could come from high server usage causing your subsequent requests to be queued with long wait times in between.

[17]:

with model.trace("Megan Rapinoe plays the sport of", remote=True):
    hs = model.model.layers[5].output[:, 5, :].save()

with model.trace("Shaquille O'Neal plays the sport of", remote=True):
    out_clean = model.lm_head.output[0][-1].argmax(dim=-1).save()


with model.trace("Shaquille O'Neal plays the sport of", remote=True):
    model.model.layers[5].output[:, 6, :] = hs # patching

    out_patched = model.lm_head.output[0][-1].argmax(dim=-1).save()


print("Clean Prediction:", model.tokenizer.decode(out_clean))
print("Corrupted Prediction:", model.tokenizer.decode(out_patched))

[2025-11-12 17:48:56] [498ff3ef-1457-4777-b398-f909e06a6f53] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:48:56] [498ff3ef-1457-4777-b398-f909e06a6f53] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:48:56] [498ff3ef-1457-4777-b398-f909e06a6f53] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:48:57] [498ff3ef-1457-4777-b398-f909e06a6f53] RUNNING    : Your job has started running.
[2025-11-12 17:48:57] [498ff3ef-1457-4777-b398-f909e06a6f53] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 83.6k/83.6k [00:00<00:00, 1.13MB/s]
[2025-11-12 17:48:57] [4c285298-bac7-430c-ab38-b3e130a9c9b6] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:48:57] [4c285298-bac7-430c-ab38-b3e130a9c9b6] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:48:58] [4c285298-bac7-430c-ab38-b3e130a9c9b6] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:48:58] [4c285298-bac7-430c-ab38-b3e130a9c9b6] RUNNING    : Your job has started running.
[2025-11-12 17:48:58] [4c285298-bac7-430c-ab38-b3e130a9c9b6] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.64k/1.64k [00:00<00:00, 13.3MB/s]
[2025-11-12 17:48:59] [e52f9955-2563-4d87-b981-ae607c429ece] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:48:59] [e52f9955-2563-4d87-b981-ae607c429ece] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:48:59] [e52f9955-2563-4d87-b981-ae607c429ece] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:48:59] [e52f9955-2563-4d87-b981-ae607c429ece] RUNNING    : Your job has started running.
[2025-11-12 17:49:00] [e52f9955-2563-4d87-b981-ae607c429ece] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.64k/1.64k [00:00<00:00, 8.71MB/s]

Clean Prediction:  basketball
Corrupted Prediction:  soccer

To improve this experience, the session API offers an elegant solution in the form a context manager in which you can define multiple traces; each trace represents a single forward pass.

[16]:

with model.session(remote=True):
    with model.trace("Megan Rapinoe plays the sport of"):
        hs = model.model.layers[5].output[:, 5, :] # no need to save

    # you can add code here

    with model.trace() as tracer:
        with tracer.invoke("Shaquille O'Neal plays the sport of"):
            out_clean = model.lm_head.output[0][-1].argmax(dim=-1).save()


        with tracer.invoke("Shaquille O'Neal plays the sport of"):
            model.model.layers[5].output[:, 6, :] = hs # patching

            out_patched = model.lm_head.output[0][-1].argmax(dim=-1).save()

    # you can add code here

print("Clean Prediction:", model.tokenizer.decode(out_clean))
print("Corrupted Prediction:", model.tokenizer.decode(out_patched))

[2025-11-12 17:31:57] [41436a2b-3e8f-45a7-8a02-7af02a35fe0b] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 17:31:57] [41436a2b-3e8f-45a7-8a02-7af02a35fe0b] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 17:31:57] [41436a2b-3e8f-45a7-8a02-7af02a35fe0b] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 17:31:58] [41436a2b-3e8f-45a7-8a02-7af02a35fe0b] RUNNING    : Your job has started running.
[2025-11-12 17:31:59] [41436a2b-3e8f-45a7-8a02-7af02a35fe0b] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 1.89k/1.89k [00:00<00:00, 17.1MB/s]

Clean Prediction:  basketball
Corrupted Prediction:  soccer

This functionality guarantess that your forward passes will be executed without any wait time or interjection from other user requests. In addition to that, results from earlier traces can be referenced by later ones in the session context without having to save them; a session counts as single remote request.

Finally, you can also define arbitrary interventions or code between and after traces, which can be handy to process results from the model, saving only final relevant results to the client.

Note that you only need to pass remote=True to the outter session context to execute the whole request remotely.

Accessing Gradients Remotely#

Gradients are disabled on the remote model deployments by default. To enable gradients and all the supported functionality all you need to do is set require_grad = True at the earliest point of interest in the model’s computation graph.

[ ]:

with model.trace("The Eiffel Tower is located in the city of", remote=True):
    hs_3 = model.model.layers[3].output
    hs_3.requires_grad = True

    hs_5 = model.model.layers[5].output

    logits = model.output.logits

    # .backward() context
    with logits.sum().backward():
        logits_grad = logits.grad.save()

        hs_5_grad = hs_5.grad.save()
        hs_3_grad = hs_3.grad.save()

[2025-11-12 18:07:14] [30d78466-428c-4beb-96ea-33de0747c1cb] RECEIVED   : Your job has been received and is waiting to be queued.
[2025-11-12 18:07:14] [30d78466-428c-4beb-96ea-33de0747c1cb] QUEUED     : Moved to position 1 in Queue.
[2025-11-12 18:07:14] [30d78466-428c-4beb-96ea-33de0747c1cb] DISPATCHED : Your job has been sent to the model deployment.
[2025-11-12 18:07:14] [30d78466-428c-4beb-96ea-33de0747c1cb] RUNNING    : Your job has started running.
[2025-11-12 18:07:15] [30d78466-428c-4beb-96ea-33de0747c1cb] COMPLETED  : Your job has been completed.
Downloading result: 100%|██████████| 199k/199k [00:00<00:00, 431kB/s]

Calling Python Modules#

nnsight >= 0.5.* is built on a paradigm that achieves deffered execution of arbitrary Python code, interleaved with the execution of the model being traced.

For remote execution on the NDIF backend, this makes a secure and protected run-code environment something crucial. Part of our security protocols to acheive this end is to maintain a whitelist of approved Python modules and packages; these packages can be referenced and called in your tracing interventions.

Whitelisted modules as of (10.28.25)

builtins
torch
collections
time
numpy
sympy
nnterp
math
einops
typing

Referencing Local Modules#

If you structure your experiment into multiple files and modules you need to register them so they can be pickled with your trace code for the complete execution of your iterventions remotely.

import cloudpickle

cloudpickle.register_pickle_by_value("<your module>")

Limitations#

Our system is in active development, so please be prepared for system outages, updates, and wait times. NDIF is running on the NCSA Delta cluster, so our services will be down during any of their planned and unplanned outages.

We currently have job length timeouts in place to ensure equitable model access between users.

Maximum Job Run Time: 1 hour

Jobs violating these parameters will be automatically denied or aborted. Please plan your experiments accordingly. You can also reach out to our team at info@ndif.us if you have a special research case and would like to request any changes!