GPUModelRunner¶
GPUModelRunner
¶
NNsightGPUModelRunner
¶
Bases: GPUModelRunner
NNsightRequestHelper
¶
Helper class for batching requests in the GPUModelRunner.
| ATTRIBUTE | DESCRIPTION |
|---|---|
ids_to_batch_group |
Dictionary mapping request IDs to their assigned batch group indices.
TYPE:
|
interleaver_to_ids |
Dictionary mapping interleavers to sets of request IDs.
TYPE:
|
flat_batch_groups |
Dictionary mapping interleavers to their flattened batch groups.
TYPE:
|
| METHOD | DESCRIPTION |
|---|---|
process_new_reqs |
List[NewRequestData]) -> None: Process new requests and compute the flat batch groups. |
process_finished_req |
str, interleaver: Interleaver) -> None: Process a finished request, by updating batch groups and cleaning up mappings. |
process_new_reqs
¶
process_new_reqs(new_reqs: List[NewRequestData], model: VLLM) -> None
Process new requests and organize them into batch groups for execution.
This method handles the batching logic for new requests, organizing them into appropriate batch groups based on their interleaver's batching strategy.
| PARAMETER | DESCRIPTION |
|---|---|
new_reqs
|
List of new request data objects to process. Each request contains sampling parameters with an associated interleaver that defines the batching behavior.
TYPE:
|
Notes
- Resets the flat_batch_groups dictionary at the start
- For interleavers that require batching, requests are assigned to batch groups
- Batch groups are tuples of (start_position, size) indicating token ranges
- Updates internal tracking dictionaries for request-to-batch-group mapping
- Advances to next batch group when current group capacity is exceeded
execute_model
¶
execute_model(scheduler_output: SchedulerOutput, intermediate_tensors: Optional[IntermediateTensors] = None)