Remote Execution#

Summary#

NNsight is integrated with NDIF’s remote backend, allowing you to run experiments on large models without needing to worry about hosting them yourself! All you need is to register for an API key (https://login.ndif.us) and ensure that your model from HuggingFace is hosted on NDIF (check out the live status of our remote models: https://nnsight.net/status/).

Accessing models remotely in NDIF is as simple as setting remote = True when you enter the tracing context. You can keep the same code you use in local experiments and run them remotely instead!

from nnsight import LanguageModel
# We'll never actually load the parameters so no need to specify a device_map.
model = LanguageModel("Model_from_HuggingFace")

# All we need to specify using NDIF vs executing locally is remote=True.
with model.trace("The Eiffel Tower is in the city of", remote=True):
    hidden_states = model.layers[-1].output.save()

When to Use#

NDIF hosts a selection of models for remote access using NNsight. If you are low on compute or would like to scale your experiment to large models like Llama-405B, you can run experiments on NDIF remote models!

How to Use#

1. Configure NDIF API Key#

To access remote models, NDIF requires you to receive an API key. To get one, simply go to https://login.ndif.us and sign up. Because NDIF hardware resources are provided by the US NSF funding, NDIF usage is restricted to US residents and their collaborators. If you are curious about your eligibility, please email info@ndif.us.

Once you have a valid NDIF API key, you then can configure nnsight by doing the following:

[ ]:
from nnsight import CONFIG

CONFIG.set_default_api_key("YOUR_API_KEY")

If you are running NNsight locally, this will save this API key as the default in a config file along with the nnsight installation, meaning that this command only needs to be run once.

Configuring NNsight API key in Google Colab

If you are running in Google Colab, you will need to configure your workspace every time, so we recommend adding your API key as a Colab Secret. Check out this article on configuring secrets.

Once you’ve saved your API key as NDIF_API in Colab Secrets, you can add this code to your notebooks for remote NDIF access on Colab.

# include your NDIF API key on Colab secrets as NDIF_API
from google.colab import userdata
NDIF_API = userdata.get('NDIF_API')

CONFIG.set_default_api_key(NDIF_API)

2. Access NDIF-hosted Models Remotely#

Now that our API key is configured, let’s run a small experiment on Llama-3.1-70b remotely!

[ ]:
import os

# llama3.1 70b is a gated model and you need access via your huggingface token
os.environ['HF_TOKEN'] = "YOUR_HUGGING_FACE_TOKEN"
[ ]:
from nnsight import LanguageModel
# We'll never actually load the parameters so no need to specify a device_map.
llama = LanguageModel("meta-llama/Meta-Llama-3.1-70B")

# All we need to specify using NDIF vs executing locally is remote=True.
with llama.trace("The Eiffel Tower is in the city of", remote=True) as tracer:

    hidden_states = llama.model.layers[-1].output.save()

    output = llama.output.save()

print(hidden_states)

print(output["logits"])
2024-08-30 07:11:21,150 MainProcess nnsight_remote INFO     36ff46f0-d81a-4586-b7e7-eaf6f97d6c0b - RECEIVED: Your job has been received and is waiting approval.
2024-08-30 07:11:21,184 MainProcess nnsight_remote INFO     36ff46f0-d81a-4586-b7e7-eaf6f97d6c0b - APPROVED: Your job was approved and is waiting to be run.
2024-08-30 07:11:21,206 MainProcess nnsight_remote INFO     36ff46f0-d81a-4586-b7e7-eaf6f97d6c0b - RUNNING: Your job has started running.
2024-08-30 07:11:21,398 MainProcess nnsight_remote INFO     36ff46f0-d81a-4586-b7e7-eaf6f97d6c0b - COMPLETED: Your job has been completed.
Downloading result:   0%|          | 0.00/9.48M [00:00<?, ?B/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading result: 100%|██████████| 9.48M/9.48M [00:02<00:00, 3.21MB/s]
(tensor([[[ 5.4688, -4.9062,  2.2344,  ..., -3.6875,  0.9609,  1.2578],
         [ 1.5469, -0.6172, -1.4531,  ..., -1.1562, -0.1406, -2.1250],
         [ 1.7812, -1.8906, -1.1875,  ...,  0.1680,  0.9609,  0.5625],
         ...,
         [ 0.9453, -0.3711,  1.3516,  ...,  1.3828, -0.7969, -1.9297],
         [-0.8906,  0.3672,  0.2617,  ...,  2.4688, -0.4414, -0.6758],
         [-1.6094,  1.0938,  1.7031,  ...,  1.8672, -1.1328, -0.5000]]],
       dtype=torch.bfloat16), DynamicCache())
tensor([[[ 6.3750,  8.6250, 13.0000,  ..., -4.1562, -4.1562, -4.1562],
         [-2.8594, -2.2344, -3.0938,  ..., -8.6250, -8.6250, -8.6250],
         [ 8.9375,  3.5938,  4.5000,  ..., -3.9375, -3.9375, -3.9375],
         ...,
         [ 3.5781,  3.4531,  0.0796,  ..., -6.5625, -6.5625, -6.5625],
         [10.8750,  6.4062,  4.9375,  ..., -4.0000, -4.0000, -3.9844],
         [ 7.2500,  6.1562,  3.5156,  ..., -4.7188, -4.7188, -4.7188]]])

It really is as simple as remote=True! All of the techniques available in NNsight locally work just the same when running remotely.

3. Streamlining Remote Experiments with Sessions#

NDIF uses a queue to handle concurrent requests from multiple users. To optimize the execution of our experiments we can use the session context to efficiently package multiple interventions together as one single request to the server.

This offers the following benefits:

  1. All interventions within a session will be executed one after another without additional wait in the queue

  2. All intermediate outputs of each intervention are stored on the server and can be accessed by other interventions in the same session without moving the data back and forth between NDIF and the local machine.

Let’s take a look:

[ ]:
with model.session(remote=True) as session:

  with model.trace("The Eiffel Tower is in the city of") as t1:
    # capture the hidden state from layer 11 at the last token
    hs_79 = model.model.layers[79].output[0][:, -1, :] # no .save()
    t1_tokens_out = model.lm_head.output.argmax(dim=-1).save()

  with model.trace("Buckingham Palace is in the city of") as t2:
    model.model.layers[1].output[0][:, -1, :] = hs_79[:]
    t2_tokens_out = model.lm_head.output.argmax(dim=-1).save()

print("\nT1 - Original Prediction: ", model.tokenizer.decode(t1_tokens_out[0][-1]))
print("T2 - Modified Prediction: ", model.tokenizer.decode(t2_tokens_out[0][-1]))

In the example above, we are interested in replacing the hidden state of a later layer with an earlier one. Since we are using a session, we don’t have to save the hidden state from Tracer 1 to reference it in Tracer 2.

It is important to note that all the traces defined within the session context are executed sequentially, strictly following the order of definition (i.e. t2 being executed after t1 and t3 after t2 etc.).

The session context object has its own methods to log values and be terminated early.

[ ]:
import nnsight
with model.session(remote=True) as session:

  nnsight.log("-- Early Stop --")
  nnsight.stop

In addition to the benefits mentioned above, the session context also enables interesting experiments not possible with other nnsight tools - since every trace is run on its own model, it means that within one session we can run interventions between different models – for example, we can swap activations between vanilla and instruct versions of the Llama model and compare the outputs. And session can also be used to run experiments entirely locally!

4. Remote Model Considerations & System Limits#

To view currently hosted models, please visit https://nnsight.net/status/. All models except for meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct are currently available for public access. If you are interested in running an experiment on Llama 405b, please reach out to us at info@ndif.us .

Our system is currently actively in development, so please be prepared for system outages, updates, and wait times. NDIF is running on NCSA Delta, so our services will be down during any of their planned and unplanned outages.

We currently have job length timeouts in place to ensure equitable model access between users.

  • Maximum Job Run Time: 1 hour

Jobs violating these parameters will be automatically denied or aborted. Please plan your experiments accordingly. You can also reach out to our team at info@ndif.us if you have a special research case and would like to request any changes!