fix: normalize root-level docx parts on import#3254
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd66339c36
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (name.startsWith('images/')) { | ||
| return `word/${name}`; | ||
| } |
There was a problem hiding this comment.
Normalize root-level media assets during path rewrite
When document.xml is rewritten to word/document.xml, image relationship targets like media/image1.jpeg are later normalized to word/media/... by the importer (normalizeTargetPath), but this path rewriter only remaps images/ and leaves root-level media/ entries untouched. That causes DocxZipper to store bytes under media/... while image resolution looks for word/media/..., so embedded images disappear for root-level packages that use media/ targets.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@financialvice same issue flagged in the inline review at line 77, extends to theme/, embeddings/, and fonts/ too.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
hey @financialvice! thanks for the PR - reading the package rels to find the real main-doc target is the right move, and the synthetic fixture is a clean way to cover this without checking in a real document.
Important
needs work, left two inline.
real-world root-level packages don't just put document.xml at root - they usually mirror Word's whole layout there. so packages with theme/ or media/ siblings (the spec-typical Target shapes per Part 1 §14.2.7 and §15.2.14) won't load themes or images after this fix because those subdirectories aren't normalized.
agent context
{
"schema": "review_handoff_v1",
"review_id": "pr-3254",
"findings": [
{
"id": "root-subdirs-not-normalized",
"severity": "important",
"anchors": [{"file": "packages/super-editor/src/editors/v1/core/DocxZipper.js", "line": 77, "quote": "if (name.startsWith('images/'))"}],
"claim": "normalizeRootLevelWordPartPath only rewrites images/, single-segment root XML, and _rels/; root-level media/, theme/, embeddings/, fonts/, customXml/ are left at root while the rels-file relocation shifts the base URI to /word/, causing downstream lookups to miss.",
"failure_mode": "after moving _rels/document.xml.rels to word/_rels/, relationship Targets like 'media/image1.png' or 'theme/theme1.xml' resolve against the new /word/ base URI per Part 2 §9.3.2.2 and Annex A.3, so downstream code looks under word/media/... and word/theme/..., but the physical files are still at root and lookups fail silently.",
"verification": [
{
"command": "vitest run on a synthetic fixture: root-level _rels/.rels -> document.xml, _rels/document.xml.rels with Target='media/image1.png' and Target='theme/theme1.xml', physical media/image1.png and theme/theme1.xml at root",
"result_summary": "mediaFiles keys: ['media/image1.png'] (expected 'word/media/image1.png'); files array does not contain 'word/theme/theme1.xml'",
"status": "executed"
}
],
"ruled_out": [
{
"hypothesis": "downstream code (normalizeTargetPath in encode-image-node-helpers.js) might already absorb non-word/-prefixed media paths",
"why_ruled_out": "normalizeTargetPath prepends word/ to non-word/-prefixed targets, producing a word/media/image1.png lookup - but the file is stored at media/image1.png because no normalization branch covers root-level media/"
},
{
"hypothesis": "this is a tested edge case the real fixture doesn't exercise",
"why_ruled_out": "Part 1 §14.2.7 (Theme) shows Target='theme/theme1.xml' and §15.2.14 (Image) shows Target='media/image1.png' - these are the spec-typical Target shapes producers emit; any spec-conforming root-level package will hit this"
}
],
"handoff": {
"recommended_start": "packages/super-editor/src/editors/v1/core/DocxZipper.js normalizeRootLevelWordPartPath",
"suggested_fix": "invert the rewrite logic: normalize any non-word/, non-[Content_Types].xml, non-_rels/.rels, non-docProps/ path by prepending word/. Alternative: rewrite relationship Target values inside _rels/document.xml.rels when relocating the rels file.",
"expected_files": [
"packages/super-editor/src/editors/v1/core/DocxZipper.js",
"packages/super-editor/src/editors/v1/core/DocxZipper.test.js"
],
"avoid_files": [
"packages/super-editor/src/editors/v1/core/super-converter/"
],
"do_not_repeat": []
},
"verification_after_fix": [
{
"command": "pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/DocxZipper.test.js",
"expected": "new test fixture with root-level media/ and theme/ confirms mediaFiles keyed by word/media/... and files contain word/theme/theme1.xml"
}
]
},
{
"id": "main-rels-hardcoded-document-xml",
"severity": "important",
"anchors": [{"file": "packages/super-editor/src/editors/v1/core/DocxZipper.js", "line": 67, "quote": "return 'word/_rels/document.xml.rels';"}],
"claim": "the main document's rels file is rewritten to a hardcoded path 'word/_rels/document.xml.rels' while the main part itself falls through to the generic root-XML branch and ends up at word/${name}; for non-'document.xml' main parts the two diverge and the importer expects the main doc at word/document.xml.",
"failure_mode": "ECMA-376 Part 1 §11.3.10 does not require the main document part to be named 'document.xml'; only that it be the target of the package-level officeDocument relationship. A producer using e.g. 'main.xml' yields word/main.xml + word/_rels/document.xml.rels after this PR, and SuperConverter never finds word/document.xml.",
"verification": [
{
"command": "code read: DocxZipper.js lines 66-74 - mainDocumentRelsPath is derived from mainDocumentPath, but the rewrite target is the literal 'word/_rels/document.xml.rels'; the root-XML branch returns word/${name}, so the two halves disagree when mainDocumentPath !== 'document.xml'",
"result_summary": "name='main.xml' -> word/main.xml; name='_rels/main.xml.rels' -> word/_rels/document.xml.rels; SuperConverter expects word/document.xml, finds only word/main.xml",
"status": "advisory"
}
],
"ruled_out": [],
"handoff": {
"recommended_start": "packages/super-editor/src/editors/v1/core/DocxZipper.js normalizeRootLevelWordPartPath - the main-document branch",
"suggested_fix": "when name === mainDocumentPath, return 'word/document.xml' (matching the hardcoded rels target). The part and its rels stay in sync regardless of the producer's chosen name.",
"expected_files": [
"packages/super-editor/src/editors/v1/core/DocxZipper.js",
"packages/super-editor/src/editors/v1/core/DocxZipper.test.js"
],
"avoid_files": [],
"do_not_repeat": []
},
"verification_after_fix": [
{
"command": "pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/DocxZipper.test.js",
"expected": "new test with main part named main.xml at root confirms files contains word/document.xml and word/_rels/document.xml.rels"
}
]
}
]
}| return `word/${name}`; | ||
| } | ||
|
|
||
| if (name.startsWith('images/')) { |
There was a problem hiding this comment.
important: only images/ is rewritten, but spec-typical root-level packages also use media/, theme/, embeddings/, fonts/. After moving _rels/document.xml.rels to word/_rels/..., the rels file's base URI shifts from / to /word/, so Target="theme/theme1.xml" and Target="media/image1.png" (the shapes shown in Part 1 §14.2.7 and §15.2.14) now resolve to word/theme/... and word/media/... - but those subdirectories aren't normalized here, so the files stay at root and downstream lookups miss them.
verified by adding a fixture with root-level theme/theme1.xml and media/image1.png and running getDocxData: mediaFiles came back as { 'media/image1.png': ... } (expected 'word/media/image1.png'), and the file list didn't contain 'word/theme/theme1.xml'.
likely fix: invert the rewrite logic - normalize any non-word/, non-[Content_Types].xml, non-_rels/.rels, non-docProps/ path by prepending word/. Or rewrite relationship Target values inside the rels XML when relocating the rels file.
| if (name.startsWith('word/')) return name; | ||
|
|
||
| const mainDocumentRelsPath = `_rels/${mainDocumentPath}.rels`; | ||
| if (name === mainDocumentRelsPath) return 'word/_rels/document.xml.rels'; |
There was a problem hiding this comment.
important: this hardcodes word/_rels/document.xml.rels as the rewrite target, but the main document part itself falls through to the generic root-XML branch and ends up at word/${name}. For a package whose main part is named anything other than document.xml (e.g. main.xml), the part lands at word/main.xml while its rels land at word/_rels/document.xml.rels, and the rest of the importer expects the main doc at word/document.xml.
ECMA-376 Part 1 §11.3.10 doesn't require the main document part to be named document.xml - only that it be the target of the package-level officeDocument relationship.
likely fix: when name === mainDocumentPath, return 'word/document.xml' so the part and its rels stay in sync regardless of the producer's chosen name.
Summary
officeDocumentrelationship during DOCX unzip.document.xmlinto SuperDoc's canonicalword/document.xmlshape.images/assets, so existing importer code can keep usingword/...paths.Why
Some valid/nonstandard DOCX producers store the main document at package root and point
_rels/.relsatdocument.xmlinstead ofword/document.xml. Before this change,SuperConverterfailed to findword/document.xml, fell back toparseXmlToJson(undefined), and threwCannot read properties of undefined (reading 'replace'), which surfaced through the CLI asDOCUMENT_OPEN_FAILED.Validation
pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/DocxZipper.test.jspnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/super-converter/SuperConverter.test.js src/editors/v1/core/DocxZipper.test.jspnpm exec prettier --write packages/super-editor/src/editors/v1/core/DocxZipper.js packages/super-editor/src/editors/v1/core/DocxZipper.test.jspnpm exec eslint packages/super-editor/src/editors/v1/core/DocxZipper.js packages/super-editor/src/editors/v1/core/DocxZipper.test.js(test file is ignored by repo config; source file passed)document.xmlfixture that previously returnedDOCUMENT_OPEN_FAILEDnow returns CLIok:truefrompnpm --prefix apps/cli run dev -- info <fixture> --json.Note: that real fixture still emits an unrelated ProseMirror
RangeError: Invalid content for node type runduring a plugin pass, but the document now opens and read/info succeeds. This PR is scoped to the DOCX package-layout import failure.