#598 changed the requirement for hiffy to be considered up: we must see HIFFY_READY == 1 and have spotted the task in a syscall.
|
fn has_task_started( |
|
hubris: &HubrisArchive, |
|
core: &mut dyn Core, |
|
task_index: u32, |
|
) -> Result<bool> { |
|
let task_t = hubris.lookup_struct_byname("Task")?; |
|
|
|
let (base, _) = hubris.task_table(core)?; |
|
|
|
let addr = base + (task_index * task_t.size as u32); |
|
let mut buffer = vec![0; task_t.size]; |
|
core.read_8(addr, &mut buffer)?; |
|
|
|
let task: humility_doppel::Task = |
|
reflect::load(hubris, &buffer, task_t, 0)?; |
|
// How can we tell that a task has finished pre-main initialization? Let me |
|
// count the ways... |
|
match task.state { |
|
TaskState::Healthy(ss) => match ss { |
|
// Tasks don't issue IPCs during pre-main, so if the task is in |
|
// SEND, REPLY, or RECV, it has by definition gotten into main. |
|
SchedState::InSend(_) |
|
| SchedState::InReply(_) |
|
| SchedState::InRecv(_) => Ok(true), |
|
// A task may be stopped or "runnable"/"running" before main, so |
|
// from this information alone, we can't tell one way or the other. |
|
SchedState::Stopped | SchedState::Runnable => Ok(false), |
|
}, |
|
// A badly configured task could hit a fault before main by initializing |
|
// memory badly, or the supervisor could deliver a fault before the task |
|
// has a chance to run. Either way, this state interferes with our |
|
// ability to reliably detect startup. |
|
TaskState::Faulted { .. } => Ok(false), |
|
} |
|
} |
The goal of that PR was to fix cases where hiffy had not yet started, but HIFFY_READY was 1 (either spuriously or due to previous RAM contents). This could occur on images which run high-priority CPU-heavy workloads at startup, since those tasks may starve hiffy of runtime. Waiting to see hiffy in a syscall gives us confidence that the task has actually booted; otherwise, Humility prints HIF execution facility unavailable and returns an error.
Unfortunately, we're now seeing failures at manufacturing time due to this change (1, 2).
[1] looks like a false positive due to this new check:
- The manufacturing software calls
SpRot.status() repeatedly and waits for it to return successfully. The mfg software ignores the HIF execution facility unavailable error at this point, because it's waiting for hiffy to start up.
- The SP is doing CPU-intensive work, but occasionally lets
hiffy run, so this call returns successfully — but the SP is still starving hiffy. At this point, mfg software expects hiffy calls to succeed.
- Mfg software tries to do another
hiffy call. The CPU-intensive work is still running, so this call fails. The mfg software is not expecting this, and bails out with an error.
In [2], it's more ambiguous who's at fault: the manufacturing software doesn't explicitly wait for hiffy to come up, so it could be a genuine case of other tasks preventing hiffy from starting up – or another false positive.
cc @Aaron-Hartwig @cbiffle
#598 changed the requirement for
hiffyto be considered up: we must seeHIFFY_READY == 1and have spotted the task in a syscall.humility/humility-hiffy/src/lib.rs
Lines 1284 to 1318 in c556213
The goal of that PR was to fix cases where
hiffyhad not yet started, butHIFFY_READYwas 1 (either spuriously or due to previous RAM contents). This could occur on images which run high-priority CPU-heavy workloads at startup, since those tasks may starvehiffyof runtime. Waiting to seehiffyin a syscall gives us confidence that the task has actually booted; otherwise, Humility printsHIF execution facility unavailableand returns an error.Unfortunately, we're now seeing failures at manufacturing time due to this change (1, 2).
[1] looks like a false positive due to this new check:
SpRot.status()repeatedly and waits for it to return successfully. The mfg software ignores theHIF execution facility unavailableerror at this point, because it's waiting forhiffyto start up.hiffyrun, so this call returns successfully — but the SP is still starvinghiffy. At this point, mfg software expectshiffycalls to succeed.hiffycall. The CPU-intensive work is still running, so this call fails. The mfg software is not expecting this, and bails out with an error.In [2], it's more ambiguous who's at fault: the manufacturing software doesn't explicitly wait for
hiffyto come up, so it could be a genuine case of other tasks preventinghiffyfrom starting up – or another false positive.cc @Aaron-Hartwig @cbiffle