fix: Replace with fast-xml-parser (pure ESM) which works in all JS runtimes including workerd.#596
fix: Replace with fast-xml-parser (pure ESM) which works in all JS runtimes including workerd.#596maikunari wants to merge 5 commits intoemdash-cms:mainfrom
Conversation
|
Scope checkThis PR changes 1,567 lines across 4 files. Large PRs are harder to review and more likely to be closed without review. If this scope is intentional, no action needed. A maintainer will review it. If not, please consider splitting this into smaller PRs. See CONTRIBUTING.md for contribution guidelines. |
@emdash-cms/admin
@emdash-cms/auth
@emdash-cms/blocks
@emdash-cms/cloudflare
emdash
create-emdash
@emdash-cms/gutenberg-to-portable-text
@emdash-cms/x402
@emdash-cms/plugin-ai-moderation
@emdash-cms/plugin-atproto
@emdash-cms/plugin-audit-log
@emdash-cms/plugin-color
@emdash-cms/plugin-embeds
@emdash-cms/plugin-forms
@emdash-cms/plugin-webhook-notifier
commit: |
|
How does memory use compare? |
Fast-xml-parser builds a full object tree, so it does use more memory than SAX's streaming approach. However, sax is CJS-only and crashes on Cloudflare Workers with module is not defined, so it's not viable for the workerd runtime. For typical WXR files (most are under 50MB), the memory footprint is reasonable. If very large imports become an issue down the line, we could explore chunked processing — but I figured getting imports working on Cloudflare first was the priority. Happy to optimize later if needed! |
There was a problem hiding this comment.
Pull request overview
Replaces the WordPress WXR parser dependency from CommonJS-only sax to fast-xml-parser to avoid module is not defined failures in Cloudflare Workers (workerd), while keeping the existing WxrData output shape.
Changes:
- Swap SAX event-driven parsing for
fast-xml-parsertree-based parsing in the WXR parser. - Remove
sax, addfast-xml-parser@^5.6.0(and update lockfile). - Add a new unit test suite covering
parseWxrString()scenarios.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
packages/core/src/cli/wxr/parser.ts |
Reimplements WXR parsing using fast-xml-parser, including item/meta/taxonomy extraction and stream handling. |
packages/core/tests/unit/wxr/parser.test.ts |
Adds unit tests for parseWxrString() to validate behavior in non-Node runtimes. |
packages/core/package.json |
Replaces sax dependency with fast-xml-parser. |
pnpm-lock.yaml |
Lockfile updates reflecting dependency swap and transitive deps. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const SINGLE_POST_WXR = `<?xml version="1.0" encoding="UTF-8"?> | ||
| <rss version="2.0" | ||
| xmlns:content="http://purl.org/rss/1.0/modules/content/" | ||
| xmlns:dc="http://purl.org/dc/elements/1.1/" |
| it("parses a single published post with categories, tags, and meta", async () => { | ||
| const data = await parseWxrString(SINGLE_POST_WXR); | ||
|
|
||
| expect(data.posts.length).toBe(1); | ||
| const post = data.posts[0]; |
|
|
||
| post.title = getText(item["title"]) || undefined; | ||
| post.link = getText(item["link"]) || undefined; | ||
| post.pubDate = getText(item["pubdate"]) || undefined; |
| stream.on("data", (chunk: Buffer) => { | ||
| chunks.push(chunk); |
| /** Parse a numeric string, returning undefined for NaN/0/missing */ | ||
| function parseIntSafe(val: string | undefined): number | undefined { | ||
| if (!val) return undefined; | ||
| const n = parseInt(val, 10); | ||
| return isNaN(n) ? undefined : n; |
| * This is compatible with all environments; for very large files (>100MB), | ||
| * consider using parseWxrString() with chunked reading instead. |
|
|
||
| import { describe, it, expect } from "vitest"; | ||
|
|
||
| import { parseWxrString, type WxrData } from "../../../src/cli/wxr/parser.js"; |
| "citty": "^0.1.6", | ||
| "consola": "^3.4.2", | ||
| "croner": "^10.0.1", | ||
| "fast-xml-parser": "^5.6.0", |
| const parser = createWxrParser(); | ||
| const parsed = parser.parse(xml) as Record<string, unknown>; | ||
| return Promise.resolve(extractWxrData(parsed)); |
The WordPress WXR parser used the 'sax' package (CommonJS-only) which caused 'module is not defined' errors in Cloudflare Workers (workerd). Replace with fast-xml-parser (pure ESM, zero deps) which works in all JS runtimes including workerd. Key changes: - Replace SAX event-driven parsing with tree-based XML parsing - parseWxrString() and parseWxr() now use fast-xml-parser's XMLParser - Remove sax dependency, add fast-xml-parser@^5.6.0 - Add 15 unit tests for the new parser implementation - Maintain identical output types and behavior (WxrData interface unchanged) Fixes emdash-cms#573
|
For handling CJS, our approach is usually to add the dependency to |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8f2688f to
ec075f7
Compare
|
Addressed Copilot's review feedback: Fixed pubDate casing bug (was reading lowercase, WXR uses capital D) |
The original sax → fast-xml-parser commit introduced 52 TS2345 errors that weren't caught locally because typecheck requires workspace builds. The helpers declared `XmlNode = string | number | ... | undefined` but every call site passes values from `Record<string, unknown>` indexing. Widen the helper signatures to accept `unknown`. The helpers already defensively runtime-check every branch, so permissive input types are safe. Removed the now-unused `XmlNode` type alias. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Follow-up: the typecheck failure in CI turned out to be pre-existing on the branch — the original sax → fast-xml-parser commit introduced 52 TS2345 errors that weren't caught locally because typecheck requires workspace builds. Widened getText/getAttr to accept unknown (they already runtime-check every branch, so permissive input is safe). All checks should pass now. |
After widening node to unknown, two paths could call String() on arbitrary values, producing "[object Object]" or similar. Replace the fallback with "" and narrow obj["#text"] to string|number before stringifying. Matches existing defensive pattern for missing #text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The WordPress WXR parser used the 'sax' package (CommonJS-only) which caused 'module is not defined' errors in Cloudflare Workers (workerd).
Replace with fast-xml-parser (pure ESM) which works in all JS runtimes including workerd.
Key changes:
Fixes #573
What does this PR do?
Replaces the sax XML parser (CommonJS-only) with fast-xml-parser (pure ESM) so WordPress WXR imports work in Cloudflare Workers (workerd). The original sax dependency crashes with module is not defined in workerd.
Fixes #573
Closes #
Type of change
Checklist
pnpm typecheckpassespnpm lintpassespnpm testpasses (or targeted tests for my change)pnpm formathas been runpnpm locale:extracthas been run (if applicable)AI-generated code disclosure
Screenshots / test output