Enhance flux-testkit: rich assertions, builder, fixtures, property-based testing, snapshots#4
Enhance flux-testkit: rich assertions, builder, fixtures, property-based testing, snapshots#4SuperInstance wants to merge 1 commit into
Conversation
| regs, cycles = self.run(bytecode, initial) | ||
| return VMState( | ||
| registers=regs, stack_top=0, pc=0, cycles=cycles, halted=True | ||
| ) |
There was a problem hiding this comment.
🔴 run_with_state hardcodes halted=True and zeroed metadata regardless of actual VM execution outcome
run_with_state at testkit.py:150-152 always constructs VMState with halted=True, stack_top=0, pc=0, and empty opcodes_executed, regardless of what actually happened during VM execution. The underlying run() method tracks halted, sp, pc, and opcodes_executed locally but discards them, returning only (regs, cycles). This means assert_halted(state, True) will always pass and assert_halted(state, False) will always fail — even for programs that never execute a HALT (e.g., hitting the cycle limit). Confirmed by running a 100-NOP program with max_cycles=10: state.halted is True even though the VM timed out.
Prompt for agents
The root cause is that _MiniVM.run() computes halted, sp, pc, and opcodes_executed as local variables but only returns (regs_dict, cycles). run_with_state() then has no way to get the real values and hardcodes them.
To fix this, refactor run() to also return the additional state (halted, pc, sp/stack_top, opcodes_executed), or have run_with_state() implement its own execution loop that captures all state. The simplest approach is to have run() return a VMState directly (or a richer tuple) so that run_with_state() can pass the real values through.
Affected files: testkit.py
Key methods: _MiniVM.run (line 85), _MiniVM.run_with_state (line 147), VMState dataclass (line 58)
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| else: | ||
| lines.append(f"{i:04d}: {name} R{bc[i+1]}, R{bc[i+2]}, R{bc[i+3]}") | ||
| i += 4 |
There was a problem hiding this comment.
🟡 Disassembler formats conditional branch instructions as 3-register RRR-type, producing incorrect output
The disassemble function (testkit.py:177-179) treats all opcodes >= 0x20 uniformly as Rd, Rs1, Rs2 (3-register) format. However, conditional branch instructions BEQ/BNE/BLT/BGE (0x3C-0x3F) have format [op, Rs, signed_offset, padding] — the second byte is the condition register, the third is a signed offset, and the fourth is padding. The disassembler incorrectly displays the offset as a register number: e.g., BNE R0, R250, R0 instead of BNE R0, -6. Confirmed by disassembling the factorial program. This produces misleading output for human readers and creates incorrect snapshots in snapshot testing.
| else: | |
| lines.append(f"{i:04d}: {name} R{bc[i+1]}, R{bc[i+2]}, R{bc[i+3]}") | |
| i += 4 | |
| if op in CONDITIONAL_OPCODES: | |
| offset = _signed_byte(bc[i+2]) | |
| lines.append(f"{i:04d}: {name} R{bc[i+1]}, {offset}") | |
| else: | |
| lines.append(f"{i:04d}: {name} R{bc[i+1]}, R{bc[i+2]}, R{bc[i+3]}") | |
| i += 4 |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
…sed testing, snapshots, 56 tests - Assertion helpers: assert_register_eq/ne/range, assert_flag_zero/negative, assert_halted, assert_true/false, assert_greater/less, assert_in_range, assert_no_error, skip - BytecodeBuilder: fluent API for constructing bytecode (movi, add, sub, mul, div, mod, and, or, xor, seq, slt, movr, beq, bne, incr, decr, push, pop, nop, halt) - FixtureGenerator: pre-built programs (movi, add, factorial, loop, nop_sled, push/pop, edge cases) + seeded random generators - PropertyChecker: for_all_values and for_all_bytecodes with counterexample reporting - SnapshotTester: disassembly snapshot testing with auto-create, match, mismatch, update - Per-opcode pass/fail reporting via opcode tags - Multi-format reports: terminal, markdown, JSON - Disassembler for FLUX bytecode - VMState class with register_eq, flag_zero, flag_negative helpers - 56 tests across 7 test classes
c5ac37a to
9afccdf
Compare
|
Closing: superseded by merged work on main. The changes from this PR have been incorporated through other merged PRs. Thank you for the contribution! 🙏 |
| if t.status == TestStatus.PASSED: | ||
| opcode_results[tag]["passed"] += 1 | ||
| elif t.status == TestStatus.FAILED: | ||
| opcode_results[tag]["failed"] += 1 | ||
| else: | ||
| opcode_results[tag]["error"] += 1 |
There was a problem hiding this comment.
🟡 per_opcode_report counts SKIPPED tests as errors, deflating pass rates
In per_opcode_report at testkit.py:298-303, the else branch catches both ERROR and SKIPPED statuses, incrementing the "error" counter for skipped tests. This inflates the error count and deflates the pass rate. For example, an opcode with 1 pass and 1 skip shows a 50% pass rate instead of 100%.
| if t.status == TestStatus.PASSED: | |
| opcode_results[tag]["passed"] += 1 | |
| elif t.status == TestStatus.FAILED: | |
| opcode_results[tag]["failed"] += 1 | |
| else: | |
| opcode_results[tag]["error"] += 1 | |
| if t.status == TestStatus.PASSED: | |
| opcode_results[tag]["passed"] += 1 | |
| elif t.status == TestStatus.FAILED: | |
| opcode_results[tag]["failed"] += 1 | |
| elif t.status == TestStatus.ERROR: | |
| opcode_results[tag]["error"] += 1 |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Comprehensive enhancement of the FLUX testkit with 56 tests.
Changes
assert_register_eq,assert_register_ne,assert_register_range,assert_flag_zero,assert_flag_negative,assert_halted,assert_true,assert_false,assert_greater,assert_less,assert_in_range,assert_no_error,skipfor_all_valuesandfor_all_bytecodesmethods, counterexample reporting on failureper_opcode_report()register_eq,flag_zero,flag_negativehelpersTest Results
56 tests passing across 7 test classes:
TestAssertions(15 tests): register, flag, halted, true/false, greater/less, range, no_error, skipTestSuiteRunner(12 tests): basic, fail, add, multi, error, cycles, markdown, JSON, terminal, size, skip, per-opcodeTestFixtureGenerators(12 tests): movi, add, factorial, edge cases, loop, nop, push/pop, randomTestBytecodeBuilder(6 tests): simple, add, loop, mul, sub, lenTestDisassembler(3 tests): simple, add, unknown opcodeTestPropertyBased(4 tests): pass, fail with counterexample, bytecodes, all_passedTestSnapshotTesting(4 tests): create/match, mismatch, update, content