hiffy sync changes lead to false negatives

https://github.com/oxidecomputer/humility/pull/598 changed the requirement for `hiffy` to be considered up: we must see `HIFFY_READY == 1` **and** have spotted the task in a syscall.

https://github.com/oxidecomputer/humility/blob/c556213e82a2590166adc0f79b3532e28e7a6456/humility-hiffy/src/lib.rs#L1284-L1318

The goal of that PR was to fix cases where `hiffy` had not yet started, but `HIFFY_READY` was 1 (either spuriously or due to previous RAM contents).  This could occur on images which run high-priority CPU-heavy workloads at startup, since those tasks may starve `hiffy` of runtime.  Waiting to see `hiffy` in a syscall gives us confidence that the task has actually booted; otherwise, Humility prints `HIF execution facility unavailable` and returns an error.

Unfortunately, we're now seeing failures at manufacturing time due to this change ([1](https://mfg-support.shared.oxide.computer/floor/activity/01KP3ET9HZ0ZNTWXV1VPXED70K), [2](https://mfg-support.shared.oxide.computer/floor/activity/01KP3KPZNT9TM8KNHY7VSTBKA4)).

[1] looks like a false positive due to this new check:

- The manufacturing software calls `SpRot.status()` repeatedly and waits for it to return successfully.  The mfg software ignores the `HIF execution facility unavailable` error at this point, because it's waiting for `hiffy` to start up.
- The SP is doing CPU-intensive work, but _occasionally_ lets `hiffy` run, so this call returns successfully — but the SP is still starving `hiffy`.  At this point, mfg software expects `hiffy` calls to succeed.
- Mfg software tries to do another `hiffy` call.  The CPU-intensive work is still running, so this call fails.  The mfg software _is not expecting this_, and bails out with an error.

In [2], it's more ambiguous who's at fault: the manufacturing software doesn't explicitly wait for `hiffy` to come up, so it could be a genuine case of other tasks preventing `hiffy` from starting up – or another false positive.

cc @Aaron-Hartwig @cbiffle 

	fn has_task_started(
	hubris: &HubrisArchive,
	core: &mut dyn Core,
	task_index: u32,
	) -> Result<bool> {
	let task_t = hubris.lookup_struct_byname("Task")?;

	let (base, _) = hubris.task_table(core)?;

	let addr = base + (task_index * task_t.size as u32);
	let mut buffer = vec![0; task_t.size];
	core.read_8(addr, &mut buffer)?;

	let task: humility_doppel::Task =
	reflect::load(hubris, &buffer, task_t, 0)?;
	// How can we tell that a task has finished pre-main initialization? Let me
	// count the ways...
	match task.state {
	TaskState::Healthy(ss) => match ss {
	// Tasks don't issue IPCs during pre-main, so if the task is in
	// SEND, REPLY, or RECV, it has by definition gotten into main.
	SchedState::InSend(_)
	\| SchedState::InReply(_)
	\| SchedState::InRecv(_) => Ok(true),
	// A task may be stopped or "runnable"/"running" before main, so
	// from this information alone, we can't tell one way or the other.
	SchedState::Stopped \| SchedState::Runnable => Ok(false),
	},
	// A badly configured task could hit a fault before main by initializing
	// memory badly, or the supervisor could deliver a fault before the task
	// has a chance to run. Either way, this state interferes with our
	// ability to reliably detect startup.
	TaskState::Faulted { .. } => Ok(false),
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hiffy sync changes lead to false negatives #614

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

hiffy sync changes lead to false negatives #614

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions