test: failing repro for vocab quiz scoring bugs (#189, #191)#195
Closed
davidortinau wants to merge 2 commits intomainfrom
Closed
test: failing repro for vocab quiz scoring bugs (#189, #191)#195davidortinau wants to merge 2 commits intomainfrom
davidortinau wants to merge 2 commits intomainfrom
Conversation
Stream B Step 1 (Jayne). Adds 4 integration tests that pin down the expected post-state of VocabularyProgress after well-defined quiz interactions, run against a real EF Core + in-memory SQLite stack via PlanGenerationTestFixture (same pattern as MasteryAlgorithmIntegrationTests). #189 — Attempt counting / accuracy: Repro189_SingleCorrectRecognitionAttempt_ProducesExpectedPanelState — PASS Repro189_SingleCorrectRecognition_LegacyProductionFieldsRemainZero — PASS Both pass on main, which proves the ProgressService math is correct. Captain's '2 production attempts / 50% accuracy' panel reading therefore points at the UI panel reading legacy/wrong fields or a duplicate-call path — fix belongs in Stream A (Kaylee), not the service. Tests stay as regression guards for the service contract. #191 — Latter rounds rapidly empty: Repro191_NewWord_AllCorrect_DoesNotRotateOutBeforeFifthTurn — FAIL on main Repro191_CharacterizeCurrentBehavior_FreshWordRotatesAtTurnN — PASS (snapshot) Captured failure: a brand-new word receiving 4 all-correct answers (3 MC followed by 1 Text — which is the mode the quiz auto-selects once CurrentStreak >= 3) flips ReadyToRotateOut=True at turn 4. VocabularyQuizItem Tier 2 (mastery>=0.50 OR streak>=3, plus only SessionCorrectCount>=2 and SessionTextCorrect>=1) is the trigger. This is the over-aggressive rotation #191 describes. Test will pass after Wash tightens the Tier 2 gates. No production code changes.
This was referenced May 3, 2026
davidortinau
added a commit
that referenced
this pull request
May 3, 2026
* test: failing repro for vocab quiz scoring bugs (#189, #191) Stream B Step 1 (Jayne). Adds 4 integration tests that pin down the expected post-state of VocabularyProgress after well-defined quiz interactions, run against a real EF Core + in-memory SQLite stack via PlanGenerationTestFixture (same pattern as MasteryAlgorithmIntegrationTests). #189 — Attempt counting / accuracy: Repro189_SingleCorrectRecognitionAttempt_ProducesExpectedPanelState — PASS Repro189_SingleCorrectRecognition_LegacyProductionFieldsRemainZero — PASS Both pass on main, which proves the ProgressService math is correct. Captain's '2 production attempts / 50% accuracy' panel reading therefore points at the UI panel reading legacy/wrong fields or a duplicate-call path — fix belongs in Stream A (Kaylee), not the service. Tests stay as regression guards for the service contract. #191 — Latter rounds rapidly empty: Repro191_NewWord_AllCorrect_DoesNotRotateOutBeforeFifthTurn — FAIL on main Repro191_CharacterizeCurrentBehavior_FreshWordRotatesAtTurnN — PASS (snapshot) Captured failure: a brand-new word receiving 4 all-correct answers (3 MC followed by 1 Text — which is the mode the quiz auto-selects once CurrentStreak >= 3) flips ReadyToRotateOut=True at turn 4. VocabularyQuizItem Tier 2 (mastery>=0.50 OR streak>=3, plus only SessionCorrectCount>=2 and SessionTextCorrect>=1) is the trigger. This is the over-aggressive rotation #191 describes. Test will pass after Wash tightens the Tier 2 gates. No production code changes. * squad(jayne): log Stream B Step 1 outcome (vocab quiz repro #189 #191) * fix(vocab-quiz): tighten rotation curve for fresh words (#191) Closes #191. Fresh words were rotating out of quiz rounds at turn 4 with all-correct answers, yielding only ~3 effective practice repetitions before the word disappeared. Two knobs are tuned to push the earliest legal rotation to turn 5 without regressing already-known words. Production changes (2 lines): 1. VocabularyProgressService.cs: EFFECTIVE_STREAK_DIVISOR 7.0f -> 12.0f Slows the mastery climb so MasteryScore reaches Tier 1 (>= 0.80) on turn 8+ rather than turn 6, and crosses the 0.50 promotion floor on turn 6 rather than turn 4. 2. VocabularyQuizItem.cs: Tier 2 trigger OR -> AND, floor (2,1) -> (4,2) - Trigger: mastery >= 0.50 && streak >= 3 (was OR). Closes a corner case where a single Text correct on a fresh word could drop the word into Tier 2 via streak alone. - Floor: SessionCorrectCount >= 4 && SessionTextCorrect >= 2 (was >= 2 && >= 1). Requires demonstrably more session evidence before a mid-mastery word is allowed to rotate out. Simulator: tools/quiz-rotation-sim/sim.py reproduces production math exactly. Headline (fresh, all-correct): | Turn | Current (/7, OR/2,1) | Proposed (/12, AND/4,2) | |------|---------------------|--------------------------| | 4 | mastery 0.714 -> ROTATES (bug) | mastery 0.417, no | | 5 | mastery 1.000 | mastery 0.583 -> ROTATES | Already-known words (mastery >= 0.80, streak >= 8) still rotate at the first qualifying turn (Tier 1 unchanged). Existing user MasteryScore data cannot regress: mastery is monotonic on correct (`max(streakScore, mastery)` in RecordAttemptAsync line 154). Tests: - Jayne's Repro191_NewWord_AllCorrect_DoesNotRotateOutBeforeFifthTurn flips FAIL -> PASS (PR #195 verification harness). - ~10 mastery-math fixtures bumped to track the new divisor (5 MC + 2 Text -> 8 MC + 2 Text for IsKnown demonstrations; divisor literals /7.0f -> /12.0f). - VocabQuizFilteringTests: Tier 2 floor test renamed and a new test Tier2_TriggerRequiresBothMasteryAndStreak added for the AND change. - All 520 unit tests pass. Language-tutor SLA review approved the turn-5 floor (vs turn-6) as the right balance between learner spaced-repetition load and within-session retention demonstration. Follow-up (separate issue, not in this PR): decouple MasteryScore from SessionRotationReady so session pacing and long-term mastery tracking are independent levers. Branched off PR #195 (Jayne's repro) so the fix lands together with its verification harness. * squad(wash): log Stream B Step 3 — #191 fix shipped via PR #198 * squad(wash): note PR #198 body cross-link to #197
Owner
Author
davidortinau
added a commit
that referenced
this pull request
May 3, 2026
- PR #196 (Stream A UI fixes): closes #189/#190/#192/#193/#194 - PR #198 (Stream B scoring fix): closes #191 - PR #195 (test-only draft): superseded, closed - Follow-ups filed: #197 (decouple Mastery from SessionRotation), #199 (test helper DifficultyWeight bug) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stream B Step 1 of the Vocabulary Quiz bug cluster — failing-first regression tests for #189 and #191. No production code changes. Author: Jayne (Tester).
These are repro/regression tests that lock down the expected behavior so Wash (backend) and Kaylee (UI) have unambiguous targets. Wash's fix should turn the failing test green; Kaylee's UI fix is informed by the diagnostic that the service-side tests already pass.
What was added
tests/SentenceStudio.UnitTests/Integration/VocabQuizScoringRepro189And191Tests.cs— 4 tests using the existingPlanGenerationTestFixture(real EF Core + in-memory SQLite + DI), modeled onMasteryAlgorithmIntegrationTests.Results on
main(commit2aab53d)Repro189_SingleCorrectRecognitionAttempt_ProducesExpectedPanelStateRepro189_SingleCorrectRecognition_LegacyProductionFieldsRemainZeroProductionAttempts/ProductionCorrectstay zero for a recognition turnRepro191_NewWord_AllCorrect_DoesNotRotateOutBeforeFifthTurnReadyToRotateOut=Trueat turn 4Repro191_CharacterizeCurrentBehavior_FreshWordRotatesAtTurnN#189 — disambiguated
Two competing hypotheses going in:
VocabularyProgressServicedouble-increments on a single attempt.Both #189 service-side tests PASS on
main. Service math is correct:→ Hypothesis (b) stands. The "2 production attempts / 50% accuracy" panel readout has to come from the UI layer, not the service. The most likely culprits are:
VocabQuiz.razorreading obsolete legacy fields (e.g.,ProductionAttemptsdirectly) instead of the new streak-based fields, orRecordPendingAttemptAsync(called fromNextItem,OverrideAsCorrect, and one other site — possible duplicate-fire vector).Both are UI/quiz-page concerns belonging to Kaylee's Stream A. The two passing service tests stay as regression guards so the service contract can't silently regress while the UI fix is in flight.
#191 — confirmed
Captured trace from the failing test (one fresh word, all answers correct, mode chosen the same way
VocabQuiz.razorchooses it — MC untilCurrentStreak>=3 OR MasteryScore>=0.5, then Text):Failure message:
Root cause is in
VocabularyQuizItem.ReadyToRotateOutTier 2 (lines 33–55 ofVocabularyQuizItem.cs): onceMasteryScore >= 0.50ORCurrentStreak >= 3, the only additional gates areSessionCorrectCount>=2ANDSessionTextCorrect>=1. With the quiz's mode auto-flip kicking Text in at turn 4, those gates are met immediately. That matches Captain's report of 26 fresh words mastered in 58 turns over 8 rounds (~2.2 turns/word).How Wash should use this
VocabularyQuizItem.ReadyToRotateOut(and/or the mode-flip threshold). Suggested rough targets: more required Text-correct turns, higher mastery floor, or per-word session minimums independent of the global session counters. Don't pick the curve unilaterally — discuss with Captain viadecisions.mdfirst.dotnet test --filter VocabQuizScoringRepro189And191Tests. BothRepro191_*tests will need updating after the fix:Repro191_NewWord_AllCorrect_DoesNotRotateOutBeforeFifthTurnshould pass.Repro191_CharacterizeCurrentBehavior_*should be updated to reflect the new first-rotation turn (or removed once the curve is canonical).How Kaylee should use this
Repro189_*tests prove the service is fine. Don't touchVocabularyProgressService.RecordAttemptAsyncfor Accurate and total attempt don't make sense #189.VocabQuiz.razor(lines ~395–460) for any reads of legacy obsolete fields — replace with streak-based equivalents.RecordPendingAttemptAsyncfor duplicate-fire (NextItem~1245,OverrideAsCorrect~1394, plus the third site near 1490).Out of scope for this PR
VocabularyProgressService,VocabularyQuizItem,VocabQuiz.razor, etc.).ChooseQuizModeForTurnmirrors the currentVocabQuiz.razorrule verbatim; if the rule moves, the helper moves with it.Verification
Branch:
test/vocab-quiz-scoring-repro-189-191, offmain(2aab53d). No conflicts with Kaylee'sfix/vocab-quiz-ui-cluster-189-194.