fix: preserve multi-byte UTF-8 across sidebar-agent stdout chunks by chappse6 · Pull Request #1007 · garrytan/gstack

chappse6 · 2026-04-15T04:19:07Z

Summary

The sidebar-agent streams claude -p stdout back to the Chrome extension. When claude emitted non-ASCII text — Korean, Japanese, Chinese, emoji, etc. — characters near chunk boundaries were mojibake'd in the sidebar. For example, Korean "합니다" would render as "핣니다".

Root cause

browse/src/sidebar-agent.ts decoded each stdout/stderr chunk independently:

proc.stdout.on('data', (data: Buffer) => {
  buffer += data.toString();   // default 'utf8'
  ...
});

A Hangul syllable is 3 bytes in UTF-8; emoji are 4. If an OS pipe flush happens to split a multi-byte sequence across two data events, Buffer.toString('utf8') replaces the trailing partial bytes of chunk N with U+FFFD and the next chunk begins mid-sequence — corrupting both. The symptom is probabilistic, tied to pipe/stream-json flush timing, which is why it showed up for longer responses with dense Hangul.

ASCII-only output is unaffected (1 byte = 1 code point, no boundary can fall inside a character).

Fix

Route stdout/stderr through Node's built-in StringDecoder, which buffers trailing partial code units across chunks and only emits fully decoded text. Flush any remainder on process close.

New tiny helper: browse/src/utf8-stream-decoder.ts (thin StringDecoder wrapper, isolated so it's unit-testable)
sidebar-agent.ts uses one decoder instance per stream per child process
No new dependencies (string_decoder is a Node built-in)

Test plan

bun test browse/test/utf8-stream-decoder.test.ts — 7 pass, 424 expects
Tests split Korean / Japanese / Chinese / 4-byte emoji / mixed text at every byte offset and verify round-trip equality
Tests confirm ASCII JSON stream-json fixtures decode identically to before
Full bun test suite — same 5 pre-existing failures as main baseline (golden-file ship skills, VERSION mismatch, uninstall mock layout); no new failures introduced
Verified scope: grep "stdout.on('data'" browse/src/ — sidebar-agent.ts is the only production site using this pattern

Impact

English / ASCII output: byte-identical to before
Korean, Japanese, Chinese, emoji, and any other multi-byte UTF-8: no longer corrupts across chunk boundaries
Risk: near-zero — same input produces the same decoded string for all ASCII; previously-corrupted inputs now decode correctly. No changes to parsing, event handling, or process lifecycle.

Korean, Japanese, Chinese, and emoji characters streamed from claude via sidebar-agent were mojibake'd when a multi-byte UTF-8 code point landed on a Buffer chunk boundary — e.g. Korean "합니다" rendered as "핣니다" in the sidebar. Per-chunk Buffer.toString('utf8') replaces partial sequences with U+FFFD and the next chunk starts mid-sequence, corrupting both chunks. ASCII-only streams are unaffected (1 byte = 1 code point). Route proc.stdout and proc.stderr through a small StringDecoder wrapper that buffers partial code units across chunks, and flush on close. Adds a unit test that splits Korean, Japanese, Chinese, 4-byte emoji, and mixed text at every possible byte offset and verifies the decoded string round-trips the original.

Address review feedback: - Extend utf8-stream-decoder.ts JSDoc to explain why the thin wrapper exists (unit-testable contract vs. inline StringDecoder usage). - Annotate the `end()` flush on process close so the intent is obvious without grepping StringDecoder's docs. No behavior change.

chappse6 added 2 commits April 15, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve multi-byte UTF-8 across sidebar-agent stdout chunks#1007

fix: preserve multi-byte UTF-8 across sidebar-agent stdout chunks#1007
chappse6 wants to merge 2 commits intogarrytan:mainfrom
chappse6:fix/korean-utf8-streaming

chappse6 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chappse6 commented Apr 15, 2026

Summary

Root cause

Fix

Test plan

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant