Skip to content

Enhance flux-testkit: rich assertions, builder, fixtures, property-based testing, snapshots#4

Closed
SuperInstance wants to merge 1 commit into
mainfrom
superz/testkit-enhance
Closed

Enhance flux-testkit: rich assertions, builder, fixtures, property-based testing, snapshots#4
SuperInstance wants to merge 1 commit into
mainfrom
superz/testkit-enhance

Conversation

@SuperInstance
Copy link
Copy Markdown
Owner

@SuperInstance SuperInstance commented Apr 13, 2026

Summary

Comprehensive enhancement of the FLUX testkit with 56 tests.

Changes

  • Rich assertion helpers: assert_register_eq, assert_register_ne, assert_register_range, assert_flag_zero, assert_flag_negative, assert_halted, assert_true, assert_false, assert_greater, assert_less, assert_in_range, assert_no_error, skip
  • BytecodeBuilder: Fluent API for constructing FLUX bytecode programs (supports movi, add, sub, mul, div, mod, and, or, xor, seq, slt, movr, beq, bne, incr, decr, push, pop, nop, halt)
  • FixtureGenerator: Pre-built test programs (movi, add, factorial, loop, NOP sled, push/pop, edge cases with max/negative immediates) plus seeded random program generators
  • PropertyChecker: Property-based testing with for_all_values and for_all_bytecodes methods, counterexample reporting on failure
  • SnapshotTester: Snapshot testing for disassembly output with auto-create, match detection, mismatch reporting, and update mode
  • Per-opcode reporting: Tag tests with opcode names, get pass/fail rates per opcode via per_opcode_report()
  • Multi-format reports: Terminal (styled), Markdown, and JSON output formats
  • Disassembler: Human-readable disassembly of FLUX bytecode with PC offsets and register/immediate display
  • VMState: Post-execution state snapshot with register_eq, flag_zero, flag_negative helpers
  • Setup/teardown: Suite-level setup and teardown hooks

Test Results

56 tests passing across 7 test classes:

  • TestAssertions (15 tests): register, flag, halted, true/false, greater/less, range, no_error, skip
  • TestSuiteRunner (12 tests): basic, fail, add, multi, error, cycles, markdown, JSON, terminal, size, skip, per-opcode
  • TestFixtureGenerators (12 tests): movi, add, factorial, edge cases, loop, nop, push/pop, random
  • TestBytecodeBuilder (6 tests): simple, add, loop, mul, sub, len
  • TestDisassembler (3 tests): simple, add, unknown opcode
  • TestPropertyBased (4 tests): pass, fail with counterexample, bytecodes, all_passed
  • TestSnapshotTesting (4 tests): create/match, mismatch, update, content

Staging: Open in Devin

Copy link
Copy Markdown

@beta-devin-ai-integration beta-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Staging: Open in Devin

Comment thread testkit.py
Comment on lines +149 to +152
regs, cycles = self.run(bytecode, initial)
return VMState(
registers=regs, stack_top=0, pc=0, cycles=cycles, halted=True
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 run_with_state hardcodes halted=True and zeroed metadata regardless of actual VM execution outcome

run_with_state at testkit.py:150-152 always constructs VMState with halted=True, stack_top=0, pc=0, and empty opcodes_executed, regardless of what actually happened during VM execution. The underlying run() method tracks halted, sp, pc, and opcodes_executed locally but discards them, returning only (regs, cycles). This means assert_halted(state, True) will always pass and assert_halted(state, False) will always fail — even for programs that never execute a HALT (e.g., hitting the cycle limit). Confirmed by running a 100-NOP program with max_cycles=10: state.halted is True even though the VM timed out.

Prompt for agents
The root cause is that _MiniVM.run() computes halted, sp, pc, and opcodes_executed as local variables but only returns (regs_dict, cycles). run_with_state() then has no way to get the real values and hardcodes them.

To fix this, refactor run() to also return the additional state (halted, pc, sp/stack_top, opcodes_executed), or have run_with_state() implement its own execution loop that captures all state. The simplest approach is to have run() return a VMState directly (or a richer tuple) so that run_with_state() can pass the real values through.

Affected files: testkit.py
Key methods: _MiniVM.run (line 85), _MiniVM.run_with_state (line 147), VMState dataclass (line 58)
Staging: Open in Devin

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Comment thread testkit.py
Comment on lines +177 to +179
else:
lines.append(f"{i:04d}: {name} R{bc[i+1]}, R{bc[i+2]}, R{bc[i+3]}")
i += 4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Disassembler formats conditional branch instructions as 3-register RRR-type, producing incorrect output

The disassemble function (testkit.py:177-179) treats all opcodes >= 0x20 uniformly as Rd, Rs1, Rs2 (3-register) format. However, conditional branch instructions BEQ/BNE/BLT/BGE (0x3C-0x3F) have format [op, Rs, signed_offset, padding] — the second byte is the condition register, the third is a signed offset, and the fourth is padding. The disassembler incorrectly displays the offset as a register number: e.g., BNE R0, R250, R0 instead of BNE R0, -6. Confirmed by disassembling the factorial program. This produces misleading output for human readers and creates incorrect snapshots in snapshot testing.

Suggested change
else:
lines.append(f"{i:04d}: {name} R{bc[i+1]}, R{bc[i+2]}, R{bc[i+3]}")
i += 4
if op in CONDITIONAL_OPCODES:
offset = _signed_byte(bc[i+2])
lines.append(f"{i:04d}: {name} R{bc[i+1]}, {offset}")
else:
lines.append(f"{i:04d}: {name} R{bc[i+1]}, R{bc[i+2]}, R{bc[i+3]}")
i += 4
Staging: Open in Devin

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

…sed testing, snapshots, 56 tests

- Assertion helpers: assert_register_eq/ne/range, assert_flag_zero/negative, assert_halted, assert_true/false, assert_greater/less, assert_in_range, assert_no_error, skip
- BytecodeBuilder: fluent API for constructing bytecode (movi, add, sub, mul, div, mod, and, or, xor, seq, slt, movr, beq, bne, incr, decr, push, pop, nop, halt)
- FixtureGenerator: pre-built programs (movi, add, factorial, loop, nop_sled, push/pop, edge cases) + seeded random generators
- PropertyChecker: for_all_values and for_all_bytecodes with counterexample reporting
- SnapshotTester: disassembly snapshot testing with auto-create, match, mismatch, update
- Per-opcode pass/fail reporting via opcode tags
- Multi-format reports: terminal, markdown, JSON
- Disassembler for FLUX bytecode
- VMState class with register_eq, flag_zero, flag_negative helpers
- 56 tests across 7 test classes
@SuperInstance SuperInstance force-pushed the superz/testkit-enhance branch from c5ac37a to 9afccdf Compare April 18, 2026 18:38
@SuperInstance
Copy link
Copy Markdown
Owner Author

Closing: superseded by merged work on main. The changes from this PR have been incorporated through other merged PRs. Thank you for the contribution! 🙏

Copy link
Copy Markdown

@beta-devin-ai-integration beta-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review (Beta)

Comment thread testkit.py
Comment on lines +298 to +303
if t.status == TestStatus.PASSED:
opcode_results[tag]["passed"] += 1
elif t.status == TestStatus.FAILED:
opcode_results[tag]["failed"] += 1
else:
opcode_results[tag]["error"] += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 per_opcode_report counts SKIPPED tests as errors, deflating pass rates

In per_opcode_report at testkit.py:298-303, the else branch catches both ERROR and SKIPPED statuses, incrementing the "error" counter for skipped tests. This inflates the error count and deflates the pass rate. For example, an opcode with 1 pass and 1 skip shows a 50% pass rate instead of 100%.

Suggested change
if t.status == TestStatus.PASSED:
opcode_results[tag]["passed"] += 1
elif t.status == TestStatus.FAILED:
opcode_results[tag]["failed"] += 1
else:
opcode_results[tag]["error"] += 1
if t.status == TestStatus.PASSED:
opcode_results[tag]["passed"] += 1
elif t.status == TestStatus.FAILED:
opcode_results[tag]["failed"] += 1
elif t.status == TestStatus.ERROR:
opcode_results[tag]["error"] += 1
Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant