Skip to content

feat(xsd-ingest): broaden default ingest to full Transitional bundle#6

Merged
caio-pizzol merged 2 commits into
mainfrom
caio/ooxml-mcp-ingest-broaden
May 12, 2026
Merged

feat(xsd-ingest): broaden default ingest to full Transitional bundle#6
caio-pizzol merged 2 commits into
mainfrom
caio/ooxml-mcp-ingest-broaden

Conversation

@caio-pizzol
Copy link
Copy Markdown
Contributor

Walks all 26 Transitional XSDs instead of just wml.xsd's import closure (12 of 26). SML, PML, VML, and several standalone shared schemas were never reaching the schema graph, so structural lookups failed on everything outside WordprocessingML. The motivating case was ds:datastoreItem, which lives in shared-customXmlDataProperties.xsd and was unreachable.

  • Default entrypoints become an explicit 9-root list whose union closure covers all 26 files. Explicit over glob so stray files in the cache dir can't sneak in.
  • No vocabulary.ts change: every targetNamespace the broader set declares is already registered. No spec-prose / XSD URI alias added — that's a separate concern.
  • New smoke test ingests the full closure and pins floors so a regression that drops a vocabulary fails.

This ships code only. The production DB is not touched by merging. Run the ingest post-merge.

Runbook (post-merge, against production DB)

bun run xsd:fetch    # only if data/xsd-cache isn't current
bun run xsd:ingest   # uses the new default entrypoints

Expected stats from the new default closure (captured locally against a clean DB):

metric before (wml.xsd only) after (9 entrypoints)
documents 12 26
symbols inserted ~1,300 5,619
namespaces ensured ~7 27
inheritance edges ~300 593
compositors ~500 1,023
child edges ~1,000 3,414
local elements ~1,500 3,231
attribute edges ~500 3,307
enums ~200 3,262
unresolved (any kind) 0 0

Smoke checks after ingest:

ooxml_element ds:datastoreItem        # was missing; should resolve under .../customXml
ooxml_element x:workbook              # SML root, should resolve under sml-main
ooxml_element p:presentation          # PML root, should resolve under pml-main
ooxml_namespace                       # 27 ingested namespaces (was ~7)

Verified locally: full test suite 59 pass / 0 fail / 0 skip (smoke tests previously skipped via skipIf(!realCacheReady) now run with the cache populated).

Review: confirm the explicit entrypoint list captures the bundle you'd expect; flag if any namespace should be excluded from the default profile. Ignore: vocabulary.ts (unchanged) and ooxml-tools.ts (unchanged from PR 1).

…l bundle

The default ingest only walked wml.xsd's import closure (12 of 26 XSDs).
SML, PML, VML, and several standalone shared schemas - including
shared-customXmlDataProperties.xsd (the home of ds:datastoreItem) -
never reached the schema graph, so structural tools failed on anything
outside WordprocessingML.

Default entrypoints become an explicit list of 9 roots whose union
closure covers all 26 files in data/xsd-cache/ecma-376-transitional/.
Explicit over glob so a stray file in the cache directory can't quietly
land in production ingest.

No code changes to vocabulary.ts: every targetNamespace the broader set
declares is already registered. No spec-prose vs XSD URI alias is added -
that's a separate concern.

Adds a smoke test that ingests the full closure and asserts (a) 26
documents parsed, (b) ds:datastoreItem resolves under the customXml
namespace, (c) SML / PML top-level elements land in their vocabularies,
and (d) no unresolved child / group / attrGroup edges. Floors are set
above today's WML-only baseline so a regression that drops a vocabulary
fails the test.

This PR ships code only; the production DB is not mutated as part of
merging. See the PR body for the post-merge runbook (run xsd:ingest,
expected deltas, smoke checks).
Two P3 findings from PR review:

- scripts/ingest-xsd/README.md still claimed the default ingest walked
  wml.xsd's 12-document closure. Updated to describe the 9-root / 26-XSD
  default and how to narrow it back when needed.
- tests/ingest-xsd/ingest.test.ts gated the new full-bundle smoke test
  on just wml.xsd. A dev with a partial cache (e.g. someone who fetched
  WML only for hand-testing) would have the test attempt to readFile a
  missing root and fail. Now gates on all 9 default entrypoints; the
  existing WML-only smoke test keeps its narrow wml.xsd-only gate.
@caio-pizzol caio-pizzol merged commit e819a33 into main May 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants