Skip to content

Add a cancellation state to the state machine and remove syncTraceActive_#1355

Open
ryanzhang22 wants to merge 3 commits intopytorch:mainfrom
ryanzhang22:export-D100019410
Open

Add a cancellation state to the state machine and remove syncTraceActive_#1355
ryanzhang22 wants to merge 3 commits intopytorch:mainfrom
ryanzhang22:export-D100019410

Conversation

@ryanzhang22
Copy link
Copy Markdown
Contributor

Summary:
Previously, we added syncTraceActive_ so that async doesn't step into its state machine loop while sync tracing is active, as both sync and async will then simultaneously drive the profiler and cause undefined behavior.

In the previous PRs, we added isAsyncActive() to check if async owns the profiler. We still needed syncTraceActive_ at that point because isAsyncActive() is reset back to False only at the end of cancel(), so we can theoretically step into the loop in the middle of cancellation. Consider the following scenario:

  1. We get a preemption request, and the main async thread starts running cancel().
  2. The profiler thread checks isAsyncActive() to decide whether to step into performRunLoopStep.
  3. If cancel() is midway through running, isAsyncActive() still returns true -- so we step into the loop, potentially causing bad behavior.

We introduce a Cancelling state that we set at the start of cancel(). If we do end up stepping into the loop, we see this state and immediately break.

Differential Revision: D100019410

…h#1269)

Summary:

Part 1 of a refactor to improve safety and readability around sync/async profiling.

This is a straightforward (no-op) split that creates Sync/AsyncActivityProfilerHandler, which contains the methods which belonged to the ActivityProfilerController but are only used for one type of tracing. The original methods in the controller become thin wrappers that forward the request to the appropriate handler.

Differential Revision: D92550422
Summary:

Part 2 of a refactor to improve safety and readability around sync/async profiling.

Sync profiling currently only uses the profiler state machine to check whether `isActive()` is true. However that doesn't tell us if sync or async is the one that actually owns the profiler at that point, which can lead to edge cases when we make decisions about whether or not to proceed based on the profiler state. See the example below.

Since the state machine behavior is only relevant for async tracing, we move the related functions from the `GenericActivityProfiler` into the Async handler. The `GenericActivityProfiler` just becomes a set of functions used to drive profiling, without needing to worry about transitions. Each handler owns their own `isActive()` state, and `ActivityProfilerController` makes decisions about how to handle scheduling between sync and async. In particular, we add a `cancel()` function both handlers so the controller can make sure both handlers are in a good state before preempting. For the async trace, `cancel()` will also clear the async config and move the state back to `WaitForRequest`.

This change makes ownership explicit, which should hopefully make reasoning about preemptions from the controller perspective easier.

Differential Revision: D93475197
…ive_

Summary:
Previously, we added `syncTraceActive_` so that async doesn't step into its state machine loop while sync tracing is active, as both sync and async will then simultaneously drive the profiler and cause undefined behavior.

In the previous PRs, we added `isAsyncActive()` to check if async owns the profiler. We still needed `syncTraceActive_` at that point because `isAsyncActive()` is reset back to `False` only at the end of `cancel()`, so we can theoretically step into the loop in the middle of cancellation. Consider the following scenario:
1. We get a preemption request, and the main async thread starts running `cancel()`.
2. The profiler thread checks `isAsyncActive()` to decide whether to step into `performRunLoopStep`.
3. If `cancel()` is midway through running, `isAsyncActive()` still returns true -- so we step into the loop, potentially causing bad behavior.

We introduce a `Cancelling` state that we set at the start of `cancel()`. If we do end up stepping into the loop, we see this state and immediately break.

Differential Revision: D100019410
@meta-cla meta-cla bot added the cla signed label Apr 9, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 9, 2026

@ryanzhang22 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100019410.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant