Skip to content

fix: preserve multi-byte UTF-8 across sidebar-agent stdout chunks#1007

Open
chappse6 wants to merge 2 commits intogarrytan:mainfrom
chappse6:fix/korean-utf8-streaming
Open

fix: preserve multi-byte UTF-8 across sidebar-agent stdout chunks#1007
chappse6 wants to merge 2 commits intogarrytan:mainfrom
chappse6:fix/korean-utf8-streaming

Conversation

@chappse6
Copy link
Copy Markdown

Summary

The sidebar-agent streams claude -p stdout back to the Chrome extension. When claude emitted non-ASCII text — Korean, Japanese, Chinese, emoji, etc. — characters near chunk boundaries were mojibake'd in the sidebar. For example, Korean "합니다" would render as "핣니다".

Root cause

browse/src/sidebar-agent.ts decoded each stdout/stderr chunk independently:

proc.stdout.on('data', (data: Buffer) => {
  buffer += data.toString();   // default 'utf8'
  ...
});

A Hangul syllable is 3 bytes in UTF-8; emoji are 4. If an OS pipe flush happens to split a multi-byte sequence across two data events, Buffer.toString('utf8') replaces the trailing partial bytes of chunk N with U+FFFD and the next chunk begins mid-sequence — corrupting both. The symptom is probabilistic, tied to pipe/stream-json flush timing, which is why it showed up for longer responses with dense Hangul.

ASCII-only output is unaffected (1 byte = 1 code point, no boundary can fall inside a character).

Fix

Route stdout/stderr through Node's built-in StringDecoder, which buffers trailing partial code units across chunks and only emits fully decoded text. Flush any remainder on process close.

  • New tiny helper: browse/src/utf8-stream-decoder.ts (thin StringDecoder wrapper, isolated so it's unit-testable)
  • sidebar-agent.ts uses one decoder instance per stream per child process
  • No new dependencies (string_decoder is a Node built-in)

Test plan

  • bun test browse/test/utf8-stream-decoder.test.ts — 7 pass, 424 expects
  • Tests split Korean / Japanese / Chinese / 4-byte emoji / mixed text at every byte offset and verify round-trip equality
  • Tests confirm ASCII JSON stream-json fixtures decode identically to before
  • Full bun test suite — same 5 pre-existing failures as main baseline (golden-file ship skills, VERSION mismatch, uninstall mock layout); no new failures introduced
  • Verified scope: grep "stdout.on('data'" browse/src/sidebar-agent.ts is the only production site using this pattern

Impact

  • English / ASCII output: byte-identical to before
  • Korean, Japanese, Chinese, emoji, and any other multi-byte UTF-8: no longer corrupts across chunk boundaries
  • Risk: near-zero — same input produces the same decoded string for all ASCII; previously-corrupted inputs now decode correctly. No changes to parsing, event handling, or process lifecycle.

Korean, Japanese, Chinese, and emoji characters streamed from claude via
sidebar-agent were mojibake'd when a multi-byte UTF-8 code point landed
on a Buffer chunk boundary — e.g. Korean "합니다" rendered as "핣니다"
in the sidebar. Per-chunk Buffer.toString('utf8') replaces partial
sequences with U+FFFD and the next chunk starts mid-sequence, corrupting
both chunks. ASCII-only streams are unaffected (1 byte = 1 code point).

Route proc.stdout and proc.stderr through a small StringDecoder wrapper
that buffers partial code units across chunks, and flush on close.

Adds a unit test that splits Korean, Japanese, Chinese, 4-byte emoji,
and mixed text at every possible byte offset and verifies the decoded
string round-trips the original.
Address review feedback:
- Extend utf8-stream-decoder.ts JSDoc to explain why the thin wrapper
  exists (unit-testable contract vs. inline StringDecoder usage).
- Annotate the `end()` flush on process close so the intent is obvious
  without grepping StringDecoder's docs.

No behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant