Skip to content

Handle orphan processes when running as PID 1#502

Merged
k82cn merged 1 commit into
xflops:mainfrom
k82cn:flm_501
Jun 15, 2026
Merged

Handle orphan processes when running as PID 1#502
k82cn merged 1 commit into
xflops:mainfrom
k82cn:flm_501

Conversation

@k82cn

@k82cn k82cn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

When flame-executor-manager runs as container entrypoint (PID 1), it now reaps orphaned child processes to prevent zombie accumulation.

Changes

Install dumb-init in the executor-manager container image and use it as
the entrypoint wrapper. This properly handles PID 1 responsibilities:

  • Reaps orphaned zombie processes
  • Forwards signals to child processes

Fixes #501

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an orphan process reaper module (reaper.rs) in executor_manager to clean up zombie processes when running as PID 1 in a container. It adds the process feature to the nix dependency and spawns a background task that reaps zombies on SIGCHLD signals and periodic intervals. Feedback highlights a critical correctness risk where calling waitpid(-1) globally can intercept and steal exit statuses of child processes spawned by other parts of the application, potentially causing ECHILD errors. It is recommended to use a container init system (like tini) instead, or track spawned PIDs. Additionally, it is suggested to configure the periodic interval to skip missed ticks to avoid burst behaviors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread executor_manager/src/reaper.rs Outdated
Comment on lines +30 to +31
loop {
match waitpid(Pid::from_raw(-1), Some(WaitPidFlag::WNOHANG)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling waitpid with Pid::from_raw(-1) (which reaps any child process) in a global loop poses a significant correctness risk for the application.

The Problem

If flame-executor-manager or any of its dependencies (such as Tokio's process module or standard library commands) spawns child processes (e.g., shims, executors, or sidecars) and expects to wait on them to retrieve their exit status, this global reaper loop will intercept and reap those processes first. When the spawning code subsequently attempts to wait on its specific child PID, it will receive an ECHILD error, leading to unexpected failures or lost exit statuses.

Recommended Solutions

  1. Use a Container Init System (Best Practice): Instead of implementing PID 1 reaping logic inside the application, run the container with a lightweight init system like tini or dumb-init as the entrypoint. These tools are specifically designed to reap orphans safely at the container level without interfering with the application's internal process management.
  2. Track Spawned PIDs: If you must reap inside the application, you need to maintain a registry of PIDs spawned by the application. When waitpid(-1) reaps a process, check if its PID is in the registry. If it is, you must store its exit status in a shared map so that the code waiting for it can retrieve it from the map instead of calling waitpid directly. However, this is complex and requires wrapping all process spawning in the application.

Comment thread executor_manager/src/reaper.rs Outdated
}
};

let mut reap_interval = interval(Duration::from_secs(5));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

By default, Tokio's interval uses MissedTickBehavior::Burst, which can cause multiple rapid ticks to fire if the event loop is delayed or busy. For a periodic background fallback like zombie reaping, it is safer to use MissedTickBehavior::Skip to avoid unnecessary bursts of reaping attempts.

Suggested change
let mut reap_interval = interval(Duration::from_secs(5));
let mut reap_interval = interval(Duration::from_secs(5));
reap_interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);

@k82cn k82cn force-pushed the flm_501 branch 2 times, most recently from 44f2710 to 0fa53ca Compare June 15, 2026 12:53
Install dumb-init in the executor-manager container image and use it as
the entrypoint wrapper. This properly handles PID 1 responsibilities:
- Reaps orphaned zombie processes
- Forwards signals to child processes

Fixes xflops#501
@k82cn k82cn merged commit 2a372d1 into xflops:main Jun 15, 2026
8 checks passed
@k82cn k82cn deleted the flm_501 branch June 15, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Host shim should handle orphan process

1 participant