Skip to content

fix(parser): sanitize lone UTF-16 surrogates before JSONL parsing (fixes #85)#88

Merged
delexw merged 1 commit into
mainfrom
fix-issue-85
May 17, 2026
Merged

fix(parser): sanitize lone UTF-16 surrogates before JSONL parsing (fixes #85)#88
delexw merged 1 commit into
mainfrom
fix-issue-85

Conversation

@delexw
Copy link
Copy Markdown
Owner

@delexw delexw commented May 16, 2026

Problem

JSONL files written by Claude Code before v2.1.132 may contain lone UTF-16 surrogate code units (e.g. \uD83D without a matching \uDCxx low surrogate) inside tool_result content block strings. This happens when Claude Code's tool-error truncation logic split a multi-byte emoji at an offset boundary.

serde_json rejects lone surrogates per RFC 8259, causing parse_entry to silently discard any JSONL line that contains one.

Fix

Add a sanitize_lone_surrogates() pre-pass that runs on the raw JSONL text before handing it to serde_json::from_str. It scans for \uXXXX escape sequences in the surrogate range (U+D800–U+DFFF) and:

  • Lone high surrogate (\uD8xx not followed by \uDCxx\uDFxx): replaced with
  • Lone low surrogate (\uDCxx\uDFxx not preceded by a valid high surrogate): replaced with
  • Valid surrogate pair (\uD8xx\uDCxx): preserved as-is

Allocation is deferred: strings containing no surrogates return Cow::Borrowed with zero copies.

parse_entry is updated to convert the raw bytes to &str first (failing fast on non-UTF-8, which serde_json::from_slice would also reject) and then apply the sanitizer before deserialization.

Files Changed

  • src-tauri/src/parser/entry.rs — added hex4_to_u16, sanitize_lone_surrogates, updated parse_entry

Tests Added

10 new tests in parser::entry::tests:

Test Verifies
sanitize_lone_surrogates_no_surrogates_returns_borrowed zero-copy fast path
sanitize_lone_high_surrogate_replaced_with_fffd lone \uD83D
sanitize_lone_low_surrogate_replaced_with_fffd lone \uDC36
sanitize_valid_surrogate_pair_unchanged valid pair preserved
sanitize_multiple_lone_surrogates_all_replaced two lone surrogates, both replaced
sanitize_high_surrogate_at_end_of_string_replaced lone surrogate at string boundary
parse_entry_with_lone_high_surrogate_succeeds end-to-end: line parses instead of returning None
parse_entry_with_lone_low_surrogate_succeeds end-to-end: lone low surrogate
parse_entry_with_valid_surrogate_pair_succeeds end-to-end: valid pair still parses

All 390 Rust tests and 352 frontend tests pass.

Fixes #85

JSONL files written by Claude Code before v2.1.132 may contain lone
UTF-16 surrogate code units (e.g. `\uD83D` without a matching low
surrogate) when the tool-error truncation logic split a multi-byte emoji
at an offset boundary. serde_json rejects lone surrogates per RFC 8259,
causing parse_entry to silently discard those lines.

Add sanitize_lone_surrogates() which scans the raw JSONL string for
`\uXXXX` escape sequences in the surrogate range (U+D800-U+DFFF) and
replaces lone surrogates with `�` before the JSON deserializer sees
them. Valid surrogate pairs (\uD8xx followed immediately by \uDCxx) are
preserved unchanged. Allocation is deferred: strings with no surrogates
return Cow::Borrowed with zero copies.

Update parse_entry to convert the input bytes to str (failing fast on
non-UTF-8) and apply the sanitizer before serde_json::from_str.

Closes #85
@delexw delexw merged commit 833e266 into main May 17, 2026
1 check failed
@delexw delexw deleted the fix-issue-85 branch May 17, 2026 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Compat] Claude Code v2.1.132: Tool error truncation may write lone UTF-16 surrogates into JSONL (emoji splits)

1 participant