Strip invisible Unicode from content model at editor initialization by romanisa · Pull Request #3299 · microsoft/roosterjs

romanisa · 2026-03-05T07:19:25Z

Summary

Strips invisible Unicode characters (zero-width chars, bidirectional marks, Unicode Tags, etc.) from text and link hrefs as a defense-in-depth measure against hidden content injection via mailto: links.

Key design decision: Sanitization runs once at editor initialization rather than on every DOM-to-model conversion, avoiding performance overhead on the hot path.

Problem

Invisible Unicode characters embedded in mailto: links can bypass Human-in-the-Loop (HiTL) review ΓÇö users see a benign-looking email draft but hidden content may be present. This is a secondary safety net (the primary fix is upstream in Copilot/BizChat).

Changes

File	Purpose
`stripInvisibleUnicode.ts`	Core utility ΓÇö regex strips ~30 categories of invisible chars from raw strings
`sanitizeInvisibleUnicode.ts`	New ΓÇö walks the entire content model at init: sanitizes text segments, link hrefs, and Text nodes inside General segment elements
`Editor.ts`	Calls `sanitizeInvisibleUnicode(initialModel)` in constructor before `setContentModel`
`checkXss.ts`	Strips invisible Unicode before `script:` XSS check (covers `insertLink` API at runtime)

What was removed from earlier iterations

linkProcessor.ts ΓÇö no longer strips on every DOM-to-model pass (hot path)
linkFormatHandler.ts ΓÇö no longer strips on every format parse (hot path)
sanitizeElement.ts ΓÇö no longer strips during HTML sanitization
decodeURIComponent pre-processing ΓÇö removed per review feedback; percent-encoded content may be intentional

Characters stripped

Zero-width characters (U+200B-U+200F)
Bidirectional controls (U+202A-U+202E, U+2066-U+2069)
Unicode Tags (U+E0001-U+E00FF)
Soft hyphens, BOM, word joiners, Hangul fillers, Mongolian FVS, interlinear annotation anchors, and other invisible formatting chars

Test plan

33 unit tests for stripInvisibleUnicode (individual chars, combinations, supplementary plane, emoji preservation, visible Unicode preservation)
9 unit tests for sanitizeInvisibleUnicode (text segments, link hrefs, tables, General segments, nested block groups, empty models)
12 unit tests for checkXss (XSS detection with invisible Unicode bypass attempts)
All 109 related tests pass on Chrome 145

≡ƒñû Generated with Claude Code

Strip invisible Unicode characters (zero-width chars, bidirectional marks, Unicode Tags U+E0001-E007F, etc.) from link href attributes at multiple layers to prevent hidden content injection via mailto: links. Changes: - Add stripInvisibleUnicode utility in roosterjs-content-model-dom - Apply stripping in sanitizeElement.ts (HTML paste/sanitization path) - Apply stripping in checkXss.ts (programmatic link insertion path) - Apply stripping in linkFormatHandler.ts (DOM-to-model conversion path) - Apply stripping in linkProcessor.ts (DOM-to-model conditional check) - Add comprehensive unit tests for all changes Bug: https://outlookweb.visualstudio.com/Outlook%20Web/_workitems/edit/409639 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Strips invisible Unicode characters from link href values across sanitization, model conversion, and XSS checking to prevent hidden-content injection (notably via mailto:).

Changes:

Introduces stripInvisibleUnicode() utility and exports it from roosterjs-content-model-dom.
Applies stripping during DOM→model conversion (linkProcessor, linkFormatHandler) and HTML sanitization (sanitizeElement).
Updates checkXss() to strip invisible Unicode before evaluating script: patterns and adds unit tests across layers.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts	Adds the core stripping utility via a consolidated regex.
packages/roosterjs-content-model-dom/lib/index.ts	Exposes `stripInvisibleUnicode` from the package entrypoint.
packages/roosterjs-content-model-dom/lib/formatHandlers/segment/linkFormatHandler.ts	Strips invisibles when reading link formats from DOM attributes.
packages/roosterjs-content-model-dom/lib/domToModel/processors/linkProcessor.ts	Strips invisibles while processing `<a href>` into the content model.
packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts	Strips invisibles when sanitizing `href` attributes.
packages/roosterjs-content-model-dom/test/domUtils/stripInvisibleUnicodeTest.ts	Adds focused unit coverage for stripping behavior across many code points.
packages/roosterjs-content-model-dom/test/domToModel/processors/linkProcessorTest.ts	Ensures DOM→model link processing strips invisible Unicode in `href`.
packages/roosterjs-content-model-core/test/command/createModelFromHtml/sanitizeElementTest.ts	Ensures sanitizer strips invisibles from `href` but not unrelated attributes.
packages/roosterjs-content-model-api/lib/publicApi/utils/checkXss.ts	Strips invisibles before XSS detection and returns sanitized links.
packages/roosterjs-content-model-api/test/publicApi/utils/checkXssTest.ts	Adds tests for invisible Unicode stripping + `script:` obfuscation detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts

packages/roosterjs-content-model-dom/lib/formatHandlers/segment/linkFormatHandler.ts

packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts

…t/linkFormatHandler.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…, use strict equality - Strip invisible Unicode from href BEFORE the script: regex check to prevent XSS bypass (e.g., s\u200Bcript: passing the check then being stripped to script:) - Guard against empty href after stripping in linkFormatHandler (only set format.href when sanitizedHref is non-empty) - Use strict equality (===) for attribute name checks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… comments Extend the stripped character set to include Mongolian free variation selectors (U+180B-180D), interlinear annotation anchors (U+FFF9-FFFB), and extended Unicode Tags (U+E0080-E00FF). Add defense-in-depth comments at each call site, document ZWJ/emoji and URL-encoding limitations, and add 4 new tests for the expanded character ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add decodeURIComponent before stripping so that URL-encoded invisible characters (e.g. %E2%80%8B for U+200B) are also caught. Falls back gracefully on malformed percent-encoding. Adds 5 new tests for URL-encoded scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts

…hot path Remove stripInvisibleUnicode from linkProcessor, linkFormatHandler, and sanitizeElement (all hot paths that run on every DOM-to-model conversion). Instead, sanitize the entire initial model once in the Editor constructor via new sanitizeInvisibleUnicode() utility. This walks all text segments, link hrefs, and Text nodes inside General segment elements. The checkXss path still covers insertLink at runtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts

Per PR review feedback, do not decode percent-encoded sequences as the content may be intentional user input. Only strip raw invisible Unicode characters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

romanisa requested review from JiuqingSong and Copilot March 5, 2026 17:09

Copilot AI reviewed Mar 5, 2026

View reviewed changes

romanisa and others added 2 commits March 5, 2026 13:07

Update packages/roosterjs-content-model-dom/lib/formatHandlers/segmen…

f0da586

…t/linkFormatHandler.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

romanisa self-assigned this Mar 5, 2026

romanisa requested review from BryanValverdeU and juliaroldi March 5, 2026 21:39

romanisa and others added 2 commits March 5, 2026 14:36

JiuqingSong reviewed Mar 9, 2026

View reviewed changes

packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts Outdated Show resolved Hide resolved

JiuqingSong reviewed Mar 12, 2026

View reviewed changes

packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts Outdated Show resolved Hide resolved

fix: remove decodeURIComponent from stripInvisibleUnicode

26bbb44

Per PR review feedback, do not decode percent-encoded sequences as the content may be intentional user input. Only strip raw invisible Unicode characters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

romanisa changed the title ~~Strip invisible Unicode from link hrefs for defense-in-depth~~ Strip invisible Unicode from content model at editor initialization Mar 13, 2026

JiuqingSong approved these changes Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip invisible Unicode from content model at editor initialization#3299

Strip invisible Unicode from content model at editor initialization#3299
romanisa wants to merge 7 commits intomicrosoft:masterfrom
romanisa:romasha/strip-invisible-unicode

romanisa commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

romanisa commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

What was removed from earlier iterations

Characters stripped

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

romanisa commented Mar 5, 2026 •

edited

Loading