Skip to content

feat(graph): T99.1.2 mark Gemma4PLECombinedProducer non-capturable#92

Merged
dndungu merged 1 commit intomainfrom
e99-t99.1.2-gemma4-ple-noncapturable
Apr 16, 2026
Merged

feat(graph): T99.1.2 mark Gemma4PLECombinedProducer non-capturable#92
dndungu merged 1 commit intomainfrom
e99-t99.1.2-gemma4-ple-noncapturable

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 16, 2026

Summary

  • Add Gemma4PLECombinedProducer to nonCapturableOps in graph/cuda_graph.go.
  • Unblocks CUDA graph capture for zerfoo's gemma4e (edge) architecture; today
    that path runs with ZERFOO_DISABLE_CUDA_GRAPH=1 because the producer's
    CPU embedding gather plus MulScalar on the fresh CPUStorage tensor
    triggers a synchronous H2D cudaMemcpy that CUDA rejects inside a capturing
    stream.
  • Companion zerfoo change (E99 T99.1.2) pre-slices the producer's full-width
    outputs into stable per-layer GPU buffers so pleSliceNode stays fully
    capturable; that PR will follow.

Rationale

The producer runs once per forward pass before the transformer loop, so placing
it in pre-capture leaves the layer-body capture region intact. Alternative
approaches (pure GPU gather kernel; node-level CanCapture interface) were
rejected for scope -- see zerfoo/docs/adr/088-gemma4-ple-cuda-graph-capture.md.

Test plan

  • go test ./graph/ -- pass
  • go build ./... -- pass
  • Cross-repo verify on DGX via Spark after the zerfoo companion lands
    (tracked as E99 T99.1.3 in zerfoo/docs/plan.md).

Refs: zerfoo E99 (T99.1.1 / T99.1.2), ADR-088.

Gemma4PLECombinedProducer performs a CPU-side gather over the shared
PLE embedding table and then calls MulScalar on the freshly-allocated
CPUStorage tensor. Inside a CUDA graph capture stream this triggers a
synchronous H2D cudaMemcpy that CUDA rejects with "operation would make
the legacy stream depend on a capturing blocking stream".

Add the op to nonCapturableOps so the producer runs in pre-capture on
every forward, outside the capturing stream. The producer runs once
per forward pass before the transformer loop, so this placement keeps
the layer-body capture region intact.

Companion change in zerfoo/inference/gemma4_edge_ple_nodes.go
(E99 T99.1.2) pre-slices the producer's outputs into stable GPU
buffers so pleSliceNode stays fully capturable.

Decision recorded in zerfoo/docs/adr/088-gemma4-ple-cuda-graph-capture.md.
@dndungu dndungu merged commit 6c855a9 into main Apr 16, 2026
1 check passed
@dndungu dndungu deleted the e99-t99.1.2-gemma4-ple-noncapturable branch April 16, 2026 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant