Skip to content

mavlink2-bridge: multi-row command batches are silently truncated to row 0 — extra commands dropped on the dora→vehicle path #2075

@phil-opp

Description

@phil-opp

Note: This issue was created by a scheduled automated Claude code-review check. It was found by randomly selecting a component (mavlink2-bridge) and reviewing it for correctness issues. Please treat the severity assessment below as a starting point for a human triage, not a confirmed defect. This is distinct from the earlier automated review #2023 (which covered the missing round-trip tests for the write-path impls and explicitly ruled out the sequence/field-width concerns); this issue is about a concrete decode-path behavior on multi-row input batches.

Summary

The Arrow → MAVLink decode path in libraries/extensions/mavlink2-bridge/src/arrow_convert.rs reads only row 0 of each incoming RecordBatch and never checks the row count. A dora producer that publishes an N-row batch (N > 1) on a command input (e.g. command_long_cmd, set_position_target_local_ned_cmd) will have rows 1..N-1 silently discarded — only the first command is encoded and sent to the vehicle. There is no error, warning, or trace.

Because this is the dora → vehicle path (the side that arms motors and commands position/velocity), silently dropping all-but-the-first command is a safety-relevant silent data loss, not just a cosmetic gap.

Affected code

read_primitive (the single chokepoint every read_* helper funnels through) guards against zero rows but then unconditionally returns value(0):

// libraries/extensions/mavlink2-bridge/src/arrow_convert.rs:86-96
if arr.is_empty() {
    return Err(BridgeError::Mavlink(format!(
        "column '{name}' has zero rows; expected ≥1 row"
    )));
}
if arr.is_null(0) {
    return Err(BridgeError::Mavlink(format!(
        "column '{name}' is null at row 0"
    )));
}
Ok(arr.value(0))   // ← rows 1..N-1 are never read

Note the error message even says "expected ≥1 row" — i.e. the contract was written as "at least one row", but every from_record_batch impl only ever consumes row 0, so the effective contract is "exactly one row" and the surplus is dropped instead of rejected.

The dispatch path has no row-count guard either:

// binaries/mavlink2-bridge-node/src/main.rs:322-331
fn decode_input<T: MavlinkArrow>(data: &ArrayRef, input: &str) -> Result<T> {
    let struct_array = data.as_struct_opt().ok_or_else(|| { /* ... */ })?;
    let batch = RecordBatch::from(struct_array);   // N rows preserved here…
    T::from_record_batch(&batch)                   // …but only row 0 is decoded
        .with_context(|| format!("decoding {input} from incoming Arrow"))
}

A StructArray of length N round-trips into an N-row RecordBatch, so the row count is available at decode time — it is simply ignored.

Why this is a correctness/safety risk

The module's own doc header states the convention is "One row per call." (arrow_convert.rs:4-5), and the negative-test file notes these inputs are public — "any node in the dataflow can publish to them" (tests/arrow_malformed.rs). But "one row per call" is enforced nowhere:

  • A Rust/Python node that batches several setpoints/commands into a single Arrow message (a natural thing to do — e.g. a planner emitting a short trajectory of SET_POSITION_TARGET_LOCAL_NED rows) will have only the first row flown, with the rest silently dropped.
  • The failure is invisible: no Err, no log. The operator believes N commands were sent; one was.

This is the opposite of the existing, deliberate hardening on this path: read_primitive was specifically written to turn malformed input into BridgeError rather than a panic (see the comment block at arrow_convert.rs:57-70). Multi-row input is just another malformed-shape case that should be handled explicitly rather than truncated.

Verification

  • grep -rn "num_rows\|len() != 1" libraries/extensions/mavlink2-bridge/src binaries/mavlink2-bridge-node/srcno row-count guard anywhere.
  • tests/arrow_malformed.rs covers wrong-column-type, null-at-row-0, and zero-row cases for COMMAND_LONG_DATA, but has no multi-row case — so this behavior is untested.
  • The encode side (to_record_batch) always emits exactly one row, so telemetry (vehicle → dora) is unaffected; the bug is purely on inbound commands.

Reproducer sketch

// Build a 2-row COMMAND_LONG batch (two distinct commands) and decode it.
let batch = /* RecordBatch with the 11 COMMAND_LONG columns, each length 2,
               row0 = MAV_CMD_NAV_TAKEOFF, row1 = MAV_CMD_NAV_LAND */;
let decoded = COMMAND_LONG_DATA::from_record_batch(&batch).unwrap();
assert_eq!(decoded.command as u32, 22); // row 0 (TAKEOFF) — row 1 (LAND) silently lost

Suggested fix

Reject (or, if intentional, explicitly iterate) multi-row batches at the decode boundary. The minimal, lowest-risk fix is a single guard in decode_input (one place, covers all six command inputs):

let batch = RecordBatch::from(struct_array);
if batch.num_rows() != 1 {
    bail!(
        "input '{input}' expects exactly one row per message, got {} \
         (multi-row command batches are not supported; send one command per message)",
        batch.num_rows()
    );
}

Alternatively, enforce it inside read_primitive (change the is_empty() check to arr.len() != 1) so the library contract is self-enforcing regardless of caller, and add a rejects_multi_row case to tests/arrow_malformed.rs. If batched commands are a desired feature, the proper fix is to loop over all rows in handle_input and send one MAVLink frame per row — but that is a larger change and should be a deliberate decision.

Things I checked and ruled out (no action needed)

Other observation (lower priority, not the focus of this issue)

Inbound telemetry forwarded via the reader thread can sit in the flume channel for up to POLL_INTERVAL (100 ms) before main drains it, because main only drains the channel at the top of each loop iteration and then blocks in events.recv_timeout(POLL_INTERVAL) with no wake from the reader (main.rs:453-481). This adds up to ~100 ms of latency/jitter to telemetry outputs. Flagging for awareness; a human may judge this acceptable for v1 or worth a separate tracking issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcliCLIdocumentationImprovements or additions to documentationrust

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions