Profiling & Performance¶

Enable per-node runtime profiling to identify bottlenecks in your pipeline.

Overview¶

CUVIS.AI includes opt-in, manual profiling that wraps each node.forward() call with high-resolution timers (time.perf_counter_ns()). Profiling is configured on the pipeline object and works transparently with both Predictor (inference) and GradientTrainer (training), since both call pipeline.forward() internally.

Key characteristics:

Zero overhead when disabled (single boolean check per node)
Cumulative — stats accumulate across all pipeline.forward() calls until explicit reset
Per-node, per-stage — timings are keyed by (execution_stage, node_name)
Constant memory — online Welford mean/std + P² approximate median, no sample history stored
Thread-safe — safe for concurrent gRPC requests on the same session

Quick Start¶

from cuvis_ai_core.training.predictor import Predictor

# 1. Enable profiling on the pipeline
pipeline.set_profiling(enabled=True, skip_first_n=3)

# 2. Run your workload through Predictor or GradientTrainer
predictor = Predictor(pipeline=pipeline, datamodule=datamodule)
predictor.predict(max_batches=350)

# 3. Print the formatted summary
print(pipeline.format_profiling_summary(total_frames=350))

Example output:

Profiling Summary (350 frames, skip_first_n=3)
Node                                     Stage        Count   Mean(ms)    Std(ms)    Min(ms)    Max(ms) Median(ms)   Total(s)
-----------------------------------------------------------------------------------------------------------------------------
sam3_tracker                             inference      347     895.08     145.43     487.49    1747.95     891.04    310.591
tracking_coco_json                       inference      347      11.60       2.47       7.63      20.38      11.18      4.025
overlay                                  inference      347      10.23       2.19       6.27      25.60       9.68      3.551
to_video                                 inference      347       7.76       2.14       5.82      43.55       7.67      2.694
video_frame                              inference      347       0.01       0.00       0.01       0.03       0.01      0.005
-----------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                                320.867
Average per-frame pipeline time: 924.69 ms (1.1 FPS)

Profiling During Inference (Predictor)¶

from cuvis_ai_core.training.predictor import Predictor

# Enable profiling before creating the Predictor
pipeline.set_profiling(
    enabled=True,
    synchronize_cuda=(device == "cuda"),
    skip_first_n=3,
)

predictor = Predictor(pipeline=pipeline, datamodule=datamodule)
predictor.predict(max_batches=350)

# Retrieve and display results
print(pipeline.format_profiling_summary(total_frames=350))

Predictor calls pipeline.forward(context=Context(stage=INFERENCE)) for each batch, so all node timings accumulate under the "inference" stage.

Use Predictor, not raw pipeline.forward()

Predictor handles batch iteration, device transfer, node reset/close, and progress bars. Running profiling through Predictor gives you realistic end-to-end timing that includes proper warm-up and teardown behavior.

Profiling During Training (GradientTrainer)¶

from cuvis_ai_core.training.trainers import GradientTrainer
from cuvis_ai_schemas.enums import ExecutionStage

# Enable profiling before training
pipeline.set_profiling(enabled=True, skip_first_n=5)

trainer = GradientTrainer(
    pipeline=pipeline,
    datamodule=datamodule,
    loss_nodes=[loss_node],
)
trainer.fit()

# View training stage timings
print(pipeline.format_profiling_summary(stage=ExecutionStage.TRAIN))

# View validation stage timings
print(pipeline.format_profiling_summary(stage=ExecutionStage.VAL))

# View all stages combined
print(pipeline.format_profiling_summary())

GradientTrainer calls pipeline.forward() with TRAIN, VAL, or TEST execution stages depending on the training phase. Stats are accumulated per (stage, node_name) pair, so you can filter by stage to compare training vs validation performance.

skip_first_n applies per accumulator key

Each (stage, node_name) pair has its own skip counter. If you set skip_first_n=5, the first 5 training forward passes and the first 5 validation forward passes are each skipped independently.

API Reference¶

`pipeline.set_profiling()`¶

pipeline.set_profiling(
    enabled: bool,
    *,
    synchronize_cuda: bool = False,
    reset: bool = False,
    skip_first_n: int = 0,
)

Configure profiling with full-replace semantics — every call fully specifies the configuration. Omitted keyword arguments receive their defaults.

Parameter	Type	Default	Description
`enabled`	`bool`	—	Activate or deactivate profiling
`synchronize_cuda`	`bool`	`False`	Call `torch.cuda.synchronize` before/after each `node.forward()` for accurate GPU wall-clock timing
`reset`	`bool`	`False`	Discard all previously accumulated statistics
`skip_first_n`	`int`	`0`	Number of initial samples per node to discard (warm-up skip). Must be >= 0

`pipeline.get_profiling_summary()`¶

pipeline.get_profiling_summary(
    stage: ExecutionStage | None = None,
) -> list[NodeProfilingStats]

Return accumulated profiling stats as a list of frozen NodeProfilingStats dataclasses. Pass stage to filter by execution stage, or None for all stages.

`pipeline.format_profiling_summary()`¶

pipeline.format_profiling_summary(
    stage: ExecutionStage | None = None,
    *,
    total_frames: int | None = None,
) -> str

Convenience method that calls get_profiling_summary() and formats the result as a text table ready for logging or printing.

`pipeline.reset_profiling()`¶

Clear all accumulated profiling statistics.

`pipeline.profiling_enabled`¶

Read-only property returning whether profiling is currently active.

`NodeProfilingStats` dataclass¶

Each entry in the profiling summary contains:

Field	Type	Description
`node_name`	`str`	Unique node name within the pipeline
`stage`	`str`	Execution stage (e.g. `"inference"`, `"train"`)
`count`	`int`	Number of recorded samples (after skip)
`mean_ms`	`float`	Mean execution time in milliseconds
`median_ms`	`float`	Approximate median (P² estimator)
`std_ms`	`float`	Population standard deviation
`min_ms`	`float`	Minimum execution time
`max_ms`	`float`	Maximum execution time
`total_ms`	`float`	Total accumulated time
`last_ms`	`float`	Most recent sample

Understanding the Output¶

Column	Meaning
Node	The `node.name` — unique within the pipeline, assigned by `CuvisPipeline`
Stage	Execution stage (`inference`, `train`, `val`, `test`)
Count	Number of `node.forward()` calls recorded (after `skip_first_n`)
Mean(ms)	Average execution time per call
Std(ms)	Population standard deviation across all calls
Min/Max(ms)	Fastest and slowest individual calls
Median(ms)	Approximate median via P² estimator (constant memory)
Total(s)	Cumulative wall-clock time for this node (in seconds)

The TOTAL row sums all nodes' total times. The FPS line divides total pipeline time by the first node's count to estimate per-frame throughput.

gRPC Profiling¶

Profiling can also be controlled remotely via gRPC. See the gRPC API Reference for details on:

SetProfiling — enable, disable, or reconfigure profiling on a session
GetProfilingSummary — retrieve per-node profiling statistics

# Example: gRPC client enabling profiling
stub.SetProfiling(
    cuvis_ai_pb2.SetProfilingRequest(
        session_id=session_id,
        enabled=True,
        synchronize_cuda=True,
        skip_first_n=3,
    )
)

# Run inference...

# Retrieve profiling summary
response = stub.GetProfilingSummary(
    cuvis_ai_pb2.GetProfilingSummaryRequest(
        session_id=session_id,
        stage=cuvis_ai_pb2.EXECUTION_STAGE_INFERENCE,
    )
)
for stat in response.node_stats:
    print(f"{stat.node_name}: {stat.mean_ms:.2f} ms ({stat.count} calls)")

Tips¶

Use Predictor / GradientTrainer for best estimates

Always run profiling through the standard orchestrators rather than calling pipeline.forward() directly. They handle device transfer, batch iteration, and node lifecycle correctly, giving you realistic timing.

Warm-up skip for CUDA pipelines

Use skip_first_n=3 or higher for CUDA pipelines. The first few forward passes include JIT compilation and CUDA kernel caching, which inflate timings significantly.

CUDA synchronization trade-off

synchronize_cuda=True gives accurate GPU wall-clock times by forcing torch.cuda.synchronize() before and after each node. This adds overhead and disables CUDA kernel pipelining — use it for profiling, not production.

Cumulative stats and reset

Stats accumulate across all forward calls (including multiple predict() runs) until you explicitly call pipeline.reset_profiling() or set_profiling(reset=True). This is useful for aggregating across a full dataset.

Stage filtering

Use the stage parameter to compare performance across execution stages:

train_summary = pipeline.format_profiling_summary(stage=ExecutionStage.TRAIN)
val_summary = pipeline.format_profiling_summary(stage=ExecutionStage.VAL)

Profiling & Performance¶

Overview¶

Quick Start¶

Profiling During Inference (Predictor)¶

Profiling During Training (GradientTrainer)¶

API Reference¶

pipeline.set_profiling()¶

pipeline.get_profiling_summary()¶

pipeline.format_profiling_summary()¶

pipeline.reset_profiling()¶

pipeline.profiling_enabled¶

NodeProfilingStats dataclass¶