Skip to content

Restore mux scaling for SCF bandwidth and latency calculations#562

Open
charles-typ wants to merge 1 commit intofacebookresearch:v2-betafrom
charles-typ:export-D99517233-to-v2-beta
Open

Restore mux scaling for SCF bandwidth and latency calculations#562
charles-typ wants to merge 1 commit intofacebookresearch:v2-betafrom
charles-typ:export-D99517233-to-v2-beta

Conversation

@charles-typ
Copy link
Copy Markdown
Contributor

Summary:
D92222329 removed the mux (percentage-running) scaling from the SCF memory
bandwidth and latency calculations in generate_arm_perf_report.py, with
the reasoning that perf stat auto-scales counter values.

While perf stat does auto-scale counter_value, counter_runtime still
reflects only the actual time the PMU was active (not the full measurement
interval). When SCF events are multiplexed (e.g., 33% mux on Grace), this
causes:

  • scf_cycles / counter_runtime to be 3x the actual SCF frequency
  • Bandwidth to be reported 3x too high
  • Latency to be reported 3x too low (unrealistically fast)

Raw perf data from Grace benchmark runs confirms SCF events run at 33% mux:

5.004597489,163520049,,nvidia_scf_pmu_0/cmem_rd_access/,1670782496,33.00,,

This restores the mux correction originally added in D71513380: scale
counter_runtime by 100 / mux to recover the full interval duration
before computing derived metrics.

Affects: nvidia_scf_mem_read_bw_MBps, nvidia_scf_mem_write_bw_MBps,
nvidia_scf_mem_latency_ns.

Differential Revision: D99517233

Summary:
D92222329 removed the mux (percentage-running) scaling from the SCF memory
bandwidth and latency calculations in `generate_arm_perf_report.py`, with
the reasoning that perf stat auto-scales counter values.

While perf stat does auto-scale `counter_value`, `counter_runtime` still
reflects only the actual time the PMU was active (not the full measurement
interval). When SCF events are multiplexed (e.g., 33% mux on Grace), this
causes:
- `scf_cycles / counter_runtime` to be 3x the actual SCF frequency
- Bandwidth to be reported 3x too high
- Latency to be reported 3x too low (unrealistically fast)

Raw perf data from Grace benchmark runs confirms SCF events run at 33% mux:
```
5.004597489,163520049,,nvidia_scf_pmu_0/cmem_rd_access/,1670782496,33.00,,
```

This restores the mux correction originally added in D71513380: scale
`counter_runtime` by `100 / mux` to recover the full interval duration
before computing derived metrics.

Affects: `nvidia_scf_mem_read_bw_MBps`, `nvidia_scf_mem_write_bw_MBps`,
`nvidia_scf_mem_latency_ns`.

Differential Revision: D99517233
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 4, 2026

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99517233.

meta-codesync bot pushed a commit that referenced this pull request Apr 8, 2026
Summary:
Pull Request resolved: #562

D92222329 removed the mux (percentage-running) scaling from the SCF memory
bandwidth and latency calculations in `generate_arm_perf_report.py`, with
the reasoning that perf stat auto-scales counter values.

While perf stat does auto-scale `counter_value`, `counter_runtime` still
reflects only the actual time the PMU was active (not the full measurement
interval). When SCF events are multiplexed (e.g., 33% mux on Grace), this
causes:
- `scf_cycles / counter_runtime` to be 3x the actual SCF frequency
- Bandwidth to be reported 3x too high
- Latency to be reported 3x too low (unrealistically fast)

Raw perf data from Grace benchmark runs confirms SCF events run at 33% mux:
```
5.004597489,163520049,,nvidia_scf_pmu_0/cmem_rd_access/,1670782496,33.00,,
```

This restores the mux correction originally added in D71513380: scale
`counter_runtime` by `100 / mux` to recover the full interval duration
before computing derived metrics.

Affects: `nvidia_scf_mem_read_bw_MBps`, `nvidia_scf_mem_write_bw_MBps`,
`nvidia_scf_mem_latency_ns`.

Reviewed By: b3nj1

Differential Revision: D99517233

fbshipit-source-id: 6fbd9469770ad240313b0b98ad65ab2d3ef2ed40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant