Skip to content

Strip invisible Unicode from content model at editor initialization#3299

Open
romanisa wants to merge 7 commits intomicrosoft:masterfrom
romanisa:romasha/strip-invisible-unicode
Open

Strip invisible Unicode from content model at editor initialization#3299
romanisa wants to merge 7 commits intomicrosoft:masterfrom
romanisa:romasha/strip-invisible-unicode

Conversation

@romanisa
Copy link
Contributor

@romanisa romanisa commented Mar 5, 2026

Summary

Strips invisible Unicode characters (zero-width chars, bidirectional marks, Unicode Tags, etc.) from text and link hrefs as a defense-in-depth measure against hidden content injection via mailto: links.

Key design decision: Sanitization runs once at editor initialization rather than on every DOM-to-model conversion, avoiding performance overhead on the hot path.

Problem

Invisible Unicode characters embedded in mailto: links can bypass Human-in-the-Loop (HiTL) review ΓÇö users see a benign-looking email draft but hidden content may be present. This is a secondary safety net (the primary fix is upstream in Copilot/BizChat).

Changes

File Purpose
stripInvisibleUnicode.ts Core utility ΓÇö regex strips ~30 categories of invisible chars from raw strings
sanitizeInvisibleUnicode.ts New ΓÇö walks the entire content model at init: sanitizes text segments, link hrefs, and Text nodes inside General segment elements
Editor.ts Calls sanitizeInvisibleUnicode(initialModel) in constructor before setContentModel
checkXss.ts Strips invisible Unicode before script: XSS check (covers insertLink API at runtime)

What was removed from earlier iterations

  • linkProcessor.ts ΓÇö no longer strips on every DOM-to-model pass (hot path)
  • linkFormatHandler.ts ΓÇö no longer strips on every format parse (hot path)
  • sanitizeElement.ts ΓÇö no longer strips during HTML sanitization
  • decodeURIComponent pre-processing ΓÇö removed per review feedback; percent-encoded content may be intentional

Characters stripped

  • Zero-width characters (U+200B-U+200F)
  • Bidirectional controls (U+202A-U+202E, U+2066-U+2069)
  • Unicode Tags (U+E0001-U+E00FF)
  • Soft hyphens, BOM, word joiners, Hangul fillers, Mongolian FVS, interlinear annotation anchors, and other invisible formatting chars

Test plan

  • 33 unit tests for stripInvisibleUnicode (individual chars, combinations, supplementary plane, emoji preservation, visible Unicode preservation)
  • 9 unit tests for sanitizeInvisibleUnicode (text segments, link hrefs, tables, General segments, nested block groups, empty models)
  • 12 unit tests for checkXss (XSS detection with invisible Unicode bypass attempts)
  • All 109 related tests pass on Chrome 145

🤖 Generated with Claude Code

Strip invisible Unicode characters (zero-width chars, bidirectional marks,
Unicode Tags U+E0001-E007F, etc.) from link href attributes at multiple
layers to prevent hidden content injection via mailto: links.

Changes:
- Add stripInvisibleUnicode utility in roosterjs-content-model-dom
- Apply stripping in sanitizeElement.ts (HTML paste/sanitization path)
- Apply stripping in checkXss.ts (programmatic link insertion path)
- Apply stripping in linkFormatHandler.ts (DOM-to-model conversion path)
- Apply stripping in linkProcessor.ts (DOM-to-model conditional check)
- Add comprehensive unit tests for all changes

Bug: https://outlookweb.visualstudio.com/Outlook%20Web/_workitems/edit/409639

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanisa romanisa requested review from JiuqingSong and Copilot March 5, 2026 17:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Strips invisible Unicode characters from link href values across sanitization, model conversion, and XSS checking to prevent hidden-content injection (notably via mailto:).

Changes:

  • Introduces stripInvisibleUnicode() utility and exports it from roosterjs-content-model-dom.
  • Applies stripping during DOM→model conversion (linkProcessor, linkFormatHandler) and HTML sanitization (sanitizeElement).
  • Updates checkXss() to strip invisible Unicode before evaluating script: patterns and adds unit tests across layers.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts Adds the core stripping utility via a consolidated regex.
packages/roosterjs-content-model-dom/lib/index.ts Exposes stripInvisibleUnicode from the package entrypoint.
packages/roosterjs-content-model-dom/lib/formatHandlers/segment/linkFormatHandler.ts Strips invisibles when reading link formats from DOM attributes.
packages/roosterjs-content-model-dom/lib/domToModel/processors/linkProcessor.ts Strips invisibles while processing <a href> into the content model.
packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts Strips invisibles when sanitizing href attributes.
packages/roosterjs-content-model-dom/test/domUtils/stripInvisibleUnicodeTest.ts Adds focused unit coverage for stripping behavior across many code points.
packages/roosterjs-content-model-dom/test/domToModel/processors/linkProcessorTest.ts Ensures DOM→model link processing strips invisible Unicode in href.
packages/roosterjs-content-model-core/test/command/createModelFromHtml/sanitizeElementTest.ts Ensures sanitizer strips invisibles from href but not unrelated attributes.
packages/roosterjs-content-model-api/lib/publicApi/utils/checkXss.ts Strips invisibles before XSS detection and returns sanitized links.
packages/roosterjs-content-model-api/test/publicApi/utils/checkXssTest.ts Adds tests for invisible Unicode stripping + script: obfuscation detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

romanisa and others added 2 commits March 5, 2026 13:07
…t/linkFormatHandler.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…, use strict equality

- Strip invisible Unicode from href BEFORE the script: regex check to
  prevent XSS bypass (e.g., s\u200Bcript: passing the check then being
  stripped to script:)
- Guard against empty href after stripping in linkFormatHandler (only
  set format.href when sanitizedHref is non-empty)
- Use strict equality (===) for attribute name checks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanisa romanisa self-assigned this Mar 5, 2026
romanisa and others added 2 commits March 5, 2026 14:36
… comments

Extend the stripped character set to include Mongolian free variation
selectors (U+180B-180D), interlinear annotation anchors (U+FFF9-FFFB),
and extended Unicode Tags (U+E0080-E00FF). Add defense-in-depth comments
at each call site, document ZWJ/emoji and URL-encoding limitations, and
add 4 new tests for the expanded character ranges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add decodeURIComponent before stripping so that URL-encoded invisible
characters (e.g. %E2%80%8B for U+200B) are also caught. Falls back
gracefully on malformed percent-encoding. Adds 5 new tests for
URL-encoded scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hot path

Remove stripInvisibleUnicode from linkProcessor, linkFormatHandler, and
sanitizeElement (all hot paths that run on every DOM-to-model conversion).
Instead, sanitize the entire initial model once in the Editor constructor
via new sanitizeInvisibleUnicode() utility. This walks all text segments,
link hrefs, and Text nodes inside General segment elements. The checkXss
path still covers insertLink at runtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per PR review feedback, do not decode percent-encoded sequences as the
content may be intentional user input. Only strip raw invisible Unicode
characters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@romanisa romanisa changed the title Strip invisible Unicode from link hrefs for defense-in-depth Strip invisible Unicode from content model at editor initialization Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants