Skip to content

fix: 3 proven race conditions in process lifecycle management #620

@rocketman-code

Description

@rocketman-code

Problem

The process lifecycle code in main.rs (run_child, lines 848-1040) uses 4 global atomics shared across 3 threads and a signal handler. Three race conditions have been proven deterministically with forced interleavings.

Proven Races

Q1: SIGKILL sent to recycled PID (kills innocent process)

The escalation thread reads CHILD_PID, then sends SIGKILL. Between the read and the kill, the child exits and its PID is recycled to a new process. The SIGKILL hits the wrong process.

Proven by forcing PID recycling via /proc/sys/kernel/ns_last_pid. The victim process (sleep 600) was killed by SIGKILL intended for the original child.

Q3: FORCE_KILLED flag set after main thread reads it

The main thread reads FORCE_KILLED as false and classifies the stop reason as Duration. The escalation thread then sets FORCE_KILLED to true. The "program did not respond to SIGTERM" warning is not printed even though SIGKILL was sent.

Proven with a barrier between the main thread's flag read and the escalation thread's flag write. One run, deterministic.

Q4: SIGINT arrives before CHILD_PID is stored (parent hangs)

SIGINT arrives between signal handler installation (line 940) and CHILD_PID store (line 948). The handler sees PID 0, skips the kill. The child never receives SIGTERM. With kill_timeout == 0, the parent hangs forever on child.wait(). A second Ctrl-C kills the parent (SA_RESETHAND), orphaning the child.

Proven with a barrier between handler installation and spawn. One run, deterministic. The comment at line 921 says "no Ctrl-C gap can orphan the child" but the proof shows the gap exists.

Root Cause

4 global atomics coordinating 3 threads and a signal handler. This is shared mutable state in concurrent code, violating Principle 1 of the project's architecture (philosophy.md: "Kill all globals").

Context

Found during CLI interaction contract enumeration when investigating the untested SIGTERM timeout warning messages (CI14/CI17). The warning messages are symptoms. The races are the disease.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions