Skip to content

fix: normalize root-level docx parts on import#3254

Open
financialvice wants to merge 1 commit into
superdoc-dev:mainfrom
financialvice:cg/root-level-docx-parts
Open

fix: normalize root-level docx parts on import#3254
financialvice wants to merge 1 commit into
superdoc-dev:mainfrom
financialvice:cg/root-level-docx-parts

Conversation

@financialvice
Copy link
Copy Markdown
Contributor

Summary

  • Resolve the package-level officeDocument relationship during DOCX unzip.
  • Normalize DOCX packages whose main document is stored at root-level document.xml into SuperDoc's canonical word/document.xml shape.
  • Normalize sibling root-level WordprocessingML parts and relationship files, plus root-level images/ assets, so existing importer code can keep using word/... paths.
  • Add a synthetic regression fixture that mirrors this package layout without checking in a real legal document.

Why

Some valid/nonstandard DOCX producers store the main document at package root and point _rels/.rels at document.xml instead of word/document.xml. Before this change, SuperConverter failed to find word/document.xml, fell back to parseXmlToJson(undefined), and threw Cannot read properties of undefined (reading 'replace'), which surfaced through the CLI as DOCUMENT_OPEN_FAILED.

Validation

  • pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/DocxZipper.test.js
  • pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/super-converter/SuperConverter.test.js src/editors/v1/core/DocxZipper.test.js
  • pnpm exec prettier --write packages/super-editor/src/editors/v1/core/DocxZipper.js packages/super-editor/src/editors/v1/core/DocxZipper.test.js
  • pnpm exec eslint packages/super-editor/src/editors/v1/core/DocxZipper.js packages/super-editor/src/editors/v1/core/DocxZipper.test.js (test file is ignored by repo config; source file passed)
  • Verified a real root-level document.xml fixture that previously returned DOCUMENT_OPEN_FAILED now returns CLI ok:true from pnpm --prefix apps/cli run dev -- info <fixture> --json.

Note: that real fixture still emits an unrelated ProseMirror RangeError: Invalid content for node type run during a plugin pass, but the document now opens and read/info succeeds. This PR is scoped to the DOCX package-layout import failure.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd66339c36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +77 to +79
if (name.startsWith('images/')) {
return `word/${name}`;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize root-level media assets during path rewrite

When document.xml is rewritten to word/document.xml, image relationship targets like media/image1.jpeg are later normalized to word/media/... by the importer (normalizeTargetPath), but this path rewriter only remaps images/ and leaves root-level media/ entries untouched. That causes DocxZipper to store bytes under media/... while image resolution looks for word/media/..., so embedded images disappear for root-level packages that use media/ targets.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@financialvice same issue flagged in the inline review at line 77, extends to theme/, embeddings/, and fonts/ too.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @financialvice! thanks for the PR - reading the package rels to find the real main-doc target is the right move, and the synthetic fixture is a clean way to cover this without checking in a real document.

Important

needs work, left two inline.

real-world root-level packages don't just put document.xml at root - they usually mirror Word's whole layout there. so packages with theme/ or media/ siblings (the spec-typical Target shapes per Part 1 §14.2.7 and §15.2.14) won't load themes or images after this fix because those subdirectories aren't normalized.

agent context
{
  "schema": "review_handoff_v1",
  "review_id": "pr-3254",
  "findings": [
    {
      "id": "root-subdirs-not-normalized",
      "severity": "important",
      "anchors": [{"file": "packages/super-editor/src/editors/v1/core/DocxZipper.js", "line": 77, "quote": "if (name.startsWith('images/'))"}],
      "claim": "normalizeRootLevelWordPartPath only rewrites images/, single-segment root XML, and _rels/; root-level media/, theme/, embeddings/, fonts/, customXml/ are left at root while the rels-file relocation shifts the base URI to /word/, causing downstream lookups to miss.",
      "failure_mode": "after moving _rels/document.xml.rels to word/_rels/, relationship Targets like 'media/image1.png' or 'theme/theme1.xml' resolve against the new /word/ base URI per Part 2 §9.3.2.2 and Annex A.3, so downstream code looks under word/media/... and word/theme/..., but the physical files are still at root and lookups fail silently.",
      "verification": [
        {
          "command": "vitest run on a synthetic fixture: root-level _rels/.rels -> document.xml, _rels/document.xml.rels with Target='media/image1.png' and Target='theme/theme1.xml', physical media/image1.png and theme/theme1.xml at root",
          "result_summary": "mediaFiles keys: ['media/image1.png'] (expected 'word/media/image1.png'); files array does not contain 'word/theme/theme1.xml'",
          "status": "executed"
        }
      ],
      "ruled_out": [
        {
          "hypothesis": "downstream code (normalizeTargetPath in encode-image-node-helpers.js) might already absorb non-word/-prefixed media paths",
          "why_ruled_out": "normalizeTargetPath prepends word/ to non-word/-prefixed targets, producing a word/media/image1.png lookup - but the file is stored at media/image1.png because no normalization branch covers root-level media/"
        },
        {
          "hypothesis": "this is a tested edge case the real fixture doesn't exercise",
          "why_ruled_out": "Part 1 §14.2.7 (Theme) shows Target='theme/theme1.xml' and §15.2.14 (Image) shows Target='media/image1.png' - these are the spec-typical Target shapes producers emit; any spec-conforming root-level package will hit this"
        }
      ],
      "handoff": {
        "recommended_start": "packages/super-editor/src/editors/v1/core/DocxZipper.js normalizeRootLevelWordPartPath",
        "suggested_fix": "invert the rewrite logic: normalize any non-word/, non-[Content_Types].xml, non-_rels/.rels, non-docProps/ path by prepending word/. Alternative: rewrite relationship Target values inside _rels/document.xml.rels when relocating the rels file.",
        "expected_files": [
          "packages/super-editor/src/editors/v1/core/DocxZipper.js",
          "packages/super-editor/src/editors/v1/core/DocxZipper.test.js"
        ],
        "avoid_files": [
          "packages/super-editor/src/editors/v1/core/super-converter/"
        ],
        "do_not_repeat": []
      },
      "verification_after_fix": [
        {
          "command": "pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/DocxZipper.test.js",
          "expected": "new test fixture with root-level media/ and theme/ confirms mediaFiles keyed by word/media/... and files contain word/theme/theme1.xml"
        }
      ]
    },
    {
      "id": "main-rels-hardcoded-document-xml",
      "severity": "important",
      "anchors": [{"file": "packages/super-editor/src/editors/v1/core/DocxZipper.js", "line": 67, "quote": "return 'word/_rels/document.xml.rels';"}],
      "claim": "the main document's rels file is rewritten to a hardcoded path 'word/_rels/document.xml.rels' while the main part itself falls through to the generic root-XML branch and ends up at word/${name}; for non-'document.xml' main parts the two diverge and the importer expects the main doc at word/document.xml.",
      "failure_mode": "ECMA-376 Part 1 §11.3.10 does not require the main document part to be named 'document.xml'; only that it be the target of the package-level officeDocument relationship. A producer using e.g. 'main.xml' yields word/main.xml + word/_rels/document.xml.rels after this PR, and SuperConverter never finds word/document.xml.",
      "verification": [
        {
          "command": "code read: DocxZipper.js lines 66-74 - mainDocumentRelsPath is derived from mainDocumentPath, but the rewrite target is the literal 'word/_rels/document.xml.rels'; the root-XML branch returns word/${name}, so the two halves disagree when mainDocumentPath !== 'document.xml'",
          "result_summary": "name='main.xml' -> word/main.xml; name='_rels/main.xml.rels' -> word/_rels/document.xml.rels; SuperConverter expects word/document.xml, finds only word/main.xml",
          "status": "advisory"
        }
      ],
      "ruled_out": [],
      "handoff": {
        "recommended_start": "packages/super-editor/src/editors/v1/core/DocxZipper.js normalizeRootLevelWordPartPath - the main-document branch",
        "suggested_fix": "when name === mainDocumentPath, return 'word/document.xml' (matching the hardcoded rels target). The part and its rels stay in sync regardless of the producer's chosen name.",
        "expected_files": [
          "packages/super-editor/src/editors/v1/core/DocxZipper.js",
          "packages/super-editor/src/editors/v1/core/DocxZipper.test.js"
        ],
        "avoid_files": [],
        "do_not_repeat": []
      },
      "verification_after_fix": [
        {
          "command": "pnpm exec vitest run --root ./packages/super-editor src/editors/v1/core/DocxZipper.test.js",
          "expected": "new test with main part named main.xml at root confirms files contains word/document.xml and word/_rels/document.xml.rels"
        }
      ]
    }
  ]
}

return `word/${name}`;
}

if (name.startsWith('images/')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: only images/ is rewritten, but spec-typical root-level packages also use media/, theme/, embeddings/, fonts/. After moving _rels/document.xml.rels to word/_rels/..., the rels file's base URI shifts from / to /word/, so Target="theme/theme1.xml" and Target="media/image1.png" (the shapes shown in Part 1 §14.2.7 and §15.2.14) now resolve to word/theme/... and word/media/... - but those subdirectories aren't normalized here, so the files stay at root and downstream lookups miss them.

verified by adding a fixture with root-level theme/theme1.xml and media/image1.png and running getDocxData: mediaFiles came back as { 'media/image1.png': ... } (expected 'word/media/image1.png'), and the file list didn't contain 'word/theme/theme1.xml'.

likely fix: invert the rewrite logic - normalize any non-word/, non-[Content_Types].xml, non-_rels/.rels, non-docProps/ path by prepending word/. Or rewrite relationship Target values inside the rels XML when relocating the rels file.

if (name.startsWith('word/')) return name;

const mainDocumentRelsPath = `_rels/${mainDocumentPath}.rels`;
if (name === mainDocumentRelsPath) return 'word/_rels/document.xml.rels';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: this hardcodes word/_rels/document.xml.rels as the rewrite target, but the main document part itself falls through to the generic root-XML branch and ends up at word/${name}. For a package whose main part is named anything other than document.xml (e.g. main.xml), the part lands at word/main.xml while its rels land at word/_rels/document.xml.rels, and the rest of the importer expects the main doc at word/document.xml.

ECMA-376 Part 1 §11.3.10 doesn't require the main document part to be named document.xml - only that it be the target of the package-level officeDocument relationship.

likely fix: when name === mainDocumentPath, return 'word/document.xml' so the part and its rels stay in sync regardless of the producer's chosen name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants