Skip to content

hiffy sync changes lead to false negatives #614

@mkeeter

Description

@mkeeter

#598 changed the requirement for hiffy to be considered up: we must see HIFFY_READY == 1 and have spotted the task in a syscall.

fn has_task_started(
hubris: &HubrisArchive,
core: &mut dyn Core,
task_index: u32,
) -> Result<bool> {
let task_t = hubris.lookup_struct_byname("Task")?;
let (base, _) = hubris.task_table(core)?;
let addr = base + (task_index * task_t.size as u32);
let mut buffer = vec![0; task_t.size];
core.read_8(addr, &mut buffer)?;
let task: humility_doppel::Task =
reflect::load(hubris, &buffer, task_t, 0)?;
// How can we tell that a task has finished pre-main initialization? Let me
// count the ways...
match task.state {
TaskState::Healthy(ss) => match ss {
// Tasks don't issue IPCs during pre-main, so if the task is in
// SEND, REPLY, or RECV, it has by definition gotten into main.
SchedState::InSend(_)
| SchedState::InReply(_)
| SchedState::InRecv(_) => Ok(true),
// A task may be stopped or "runnable"/"running" before main, so
// from this information alone, we can't tell one way or the other.
SchedState::Stopped | SchedState::Runnable => Ok(false),
},
// A badly configured task could hit a fault before main by initializing
// memory badly, or the supervisor could deliver a fault before the task
// has a chance to run. Either way, this state interferes with our
// ability to reliably detect startup.
TaskState::Faulted { .. } => Ok(false),
}
}

The goal of that PR was to fix cases where hiffy had not yet started, but HIFFY_READY was 1 (either spuriously or due to previous RAM contents). This could occur on images which run high-priority CPU-heavy workloads at startup, since those tasks may starve hiffy of runtime. Waiting to see hiffy in a syscall gives us confidence that the task has actually booted; otherwise, Humility prints HIF execution facility unavailable and returns an error.

Unfortunately, we're now seeing failures at manufacturing time due to this change (1, 2).

[1] looks like a false positive due to this new check:

  • The manufacturing software calls SpRot.status() repeatedly and waits for it to return successfully. The mfg software ignores the HIF execution facility unavailable error at this point, because it's waiting for hiffy to start up.
  • The SP is doing CPU-intensive work, but occasionally lets hiffy run, so this call returns successfully — but the SP is still starving hiffy. At this point, mfg software expects hiffy calls to succeed.
  • Mfg software tries to do another hiffy call. The CPU-intensive work is still running, so this call fails. The mfg software is not expecting this, and bails out with an error.

In [2], it's more ambiguous who's at fault: the manufacturing software doesn't explicitly wait for hiffy to come up, so it could be a genuine case of other tasks preventing hiffy from starting up – or another false positive.

cc @Aaron-Hartwig @cbiffle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions