Complete scrub_surrogates' lone-surrogate coverage (CR follow-up to #146)#150
Conversation
CR observed (verified empirically): scrub_surrogates raises UnicodeEncodeError on lone HIGH surrogates (U+D800–U+DBFF) because Python's surrogateescape error handler only back-maps the LOW range (U+DC80–U+DCFF) produced when decoding raw bytes. High surrogates can still appear in JSONL via explicit \uD800–\uDBFF escapes from upstream tools — function name implied broader coverage than the implementation delivered. Pre-substitute high surrogates with U+FFFD via re.sub before the surrogateescape encode step. Low-range handling unchanged. Both ranges now produce U+FFFD as the on-disk sentinel. Extended parametrized tests: - test_surrogate_encoding.py renamed test_various_low_surrogates_handled → test_various_lone_surrogates_handled, added U+D800 and U+DBFF. - test_tui_surrogate.py parametrized over both low (U+DCB2) and high (U+D800) surrogates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe PR extends surrogate handling in the UTF-8 scrubbing helper to cover both low (U+DC80–U+DCFF) and high (U+D800–U+DBFF) lone surrogate code points. A regex pattern pre-substitutes high surrogates before the existing surrogateescape round-trip, and test parametrization expands coverage across both ranges. ChangesExtend Surrogate Handling to High Surrogates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
CodeRabbit's review on PR #146 flagged that
scrub_surrogatesraisesUnicodeEncodeErroron lone high surrogates (U+D800–U+DBFF) — verified empirically. Python'ssurrogateescapeerror handler only back-maps the low range (U+DC80–U+DCFF) produced when decoding raw bytes; lone high surrogates have no mapping and the encode step raises on them. The function name's promise ("scrub lone surrogates") implied broader coverage than the implementation delivered.This PR pre-substitutes high surrogates with U+FFFD via
re.subbefore thesurrogateescapeencode step. Low-range handling unchanged. Both ranges now produce U+FFFD as the on-disk sentinel.Verification
Pre-fix,
scrub_surrogates(chr(0xD800))raised. Post-fix:Test plan
test/test_surrogate_encoding.py::test_various_lone_surrogates_handled— parametrize extended over the full lone-surrogate range: U+DC80, U+DCB2, U+DCFF (low), U+D800, U+DBFF (high). Renamed fromtest_various_low_surrogates_handledfor accuracy.test/test_tui_surrogate.py::test_session_export_scrubs_lone_surrogate— parametrized over both ranges (low U+DCB2, high U+D800). This also closes CR's parametrize-the-TUI-test nitpick from the same review (one commit addresses both findings).just ciclean — 1721+ collected, ruff + pyright + ty all green.Out-of-scope but related
CR's review mentioned the proposed regex
r"[\ud800-\udbff]"covers exactly the high-surrogate range. The full Unicode surrogate block is U+D800-U+DFFF; the low half (U+DC00-U+DCFF) is whatsurrogateescapealready handles via the encode round-trip, so we only need to pre-substitute the high half. Function docstring updated to spell out the two-mechanism strategy.Summary by CodeRabbit
Bug Fixes
Tests