Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ The XML you provide is wrapped in a minimal `w:document > w:body` structure auto

## MCP Server

Cloudflare Worker exposing two tool families over MCP, backed by the same database.
Cloudflare Worker exposing three tool families over MCP. Prose and schema-lookup tools are backed by the database; package-metadata tool reads a curated static dataset bundled with the worker.

Prose search over the spec PDFs (powered by `spec_content`):

Expand All @@ -120,6 +120,10 @@ Structural queries over the XSD schema graph (powered by `xsd_*` tables):
- `ooxml_enum` - simpleType enumeration values
- `ooxml_namespace` - vocabularies and per-profile symbol counts for a namespace URI

OPC package metadata (powered by the curated `opc-parts.ts` dataset):

- `ooxml_package_part` - part-type info by content type, source relationship type, or query substring

Uses PostgreSQL with pgvector (Neon serverless in production, Docker locally).

## Data Pipelines
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,11 @@ url = "https://api.ooxml.dev/mcp"
}
```

Two tool families share one server:
Three tool families share one server:

- **Prose search** (over the spec PDFs): `ooxml_search`, `ooxml_section`, `ooxml_parts`
- **Schema lookup** (over the parsed XSDs): `ooxml_element`, `ooxml_type`, `ooxml_children`, `ooxml_attributes`, `ooxml_enum`, `ooxml_namespace`
- **Package metadata** (curated from Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x): `ooxml_package_part`

## Development

Expand Down
11 changes: 10 additions & 1 deletion apps/mcp-server/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# OOXML Reference MCP Server

Cloudflare Worker that exposes ECMA-376 (Office Open XML) over the Model Context Protocol. Two tool families share one server:
Cloudflare Worker that exposes ECMA-376 (Office Open XML) over the Model Context Protocol. Three tool families share one server:

- **Prose search** — semantic search across the four ECMA-376 part PDFs (~18,000 chunks, embedded with Voyage, queried with pgvector).
- **Schema lookup** — deterministic queries over the parsed XSD graph (profiles, namespaces, symbols, content models, attributes, enums).
- **Package metadata** — curated OPC part-type reference (content types, source relationship types, root namespaces, typical paths in the package).

Hosted at `https://api.ooxml.dev/mcp`.

Expand Down Expand Up @@ -69,6 +70,14 @@ Any MCP-compatible client that speaks Streamable HTTP can connect to the endpoin

Default profile is `transitional`. Future profiles will compose Transitional with Office extension schemas.

### Package metadata

| Tool | Returns |
| --- | --- |
| `ooxml_package_part` | OPC part type by content type, source relationship type, or query substring (Word / Excel / PowerPoint + cross-cutting parts) |

Curated from ECMA-376 Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x. Answers package-level questions the schema graph and prose corpus don't cover (e.g. "what kind of part is `/customXml/item1.xml`?").

## Development

```bash
Expand Down
10 changes: 6 additions & 4 deletions apps/mcp-server/src/index.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
/**
* OOXML Reference MCP Server
*
* Cloudflare Worker exposing two tool families over MCP:
* - prose search over ECMA-376 PDFs (ooxml_search, ooxml_section, ooxml_parts)
* - schema lookup over the parsed XSD graph (ooxml_element, ooxml_type,
* ooxml_children, ooxml_attributes, ooxml_enum, ooxml_namespace)
* Cloudflare Worker exposing three tool families over MCP:
* - prose search over ECMA-376 PDFs (ooxml_search, ooxml_section, ooxml_parts)
* - schema lookup over the parsed XSD graph (ooxml_element, ooxml_type,
* ooxml_children, ooxml_attributes, ooxml_enum, ooxml_namespace)
* - package metadata curated from Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x
* (ooxml_package_part)
*/

import { createDb } from "./db";
Expand Down
2 changes: 1 addition & 1 deletion apps/mcp-server/src/mcp.ts
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ function handleInitialize(id: number | string | null): JsonRpcResponse {
version: "0.1.0",
},
instructions:
"OOXML (ECMA-376 / Office Open XML) reference server. Two tool families: prose search over the spec PDFs (ooxml_search, ooxml_section, ooxml_parts) and deterministic schema lookup over the parsed XSDs (ooxml_element, ooxml_type, ooxml_children, ooxml_attributes, ooxml_enum, ooxml_namespace).",
"OOXML (ECMA-376 / Office Open XML) reference server. Three tool families: (1) prose search over the spec PDFs (ooxml_search, ooxml_section, ooxml_parts); (2) deterministic schema lookup over the parsed XSDs (ooxml_element, ooxml_type, ooxml_children, ooxml_attributes, ooxml_enum, ooxml_namespace); (3) OPC package metadata curated from Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x (ooxml_package_part). The three corpora can disagree about URIs for the same concept (custom XML data storage is the canonical example); each tool surface notes when it keys on the XSD URI vs the spec-prose URI.",
},
};
}
Expand Down
148 changes: 147 additions & 1 deletion apps/mcp-server/src/ooxml-tools.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@ import {
parseQName,
type SymbolHit,
} from "./ooxml-queries";
import {
contentTypesOf,
findPartByContentType,
findPartsByRelationshipType,
type OpcPart,
searchParts,
} from "./opc-parts";

export const DEFAULT_PROFILE = "transitional";

Expand Down Expand Up @@ -130,6 +137,33 @@ export const OOXML_TOOL_DEFS: ToolDef[] = [
},
},
},
{
name: "ooxml_package_part",
description:
"Look up OPC (Open Packaging Conventions) part types: content type, source relationship type, root namespace and element, typical paths in the package. Answers 'what kind of part is /customXml/item1.xml?' — package metadata that the schema graph doesn't capture. Four modes: " +
"(1) `content_type` exact match (e.g. 'application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml'); " +
"(2) `relationship_type` exact match (e.g. '.../officeDocument/2006/relationships/customXmlProps'); " +
"(3) `query` case-insensitive substring across name, content type, relationship type, root namespace and element; " +
"(4) no args → list every curated part. Curated from ECMA-376 Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x; covers Word / Excel / PowerPoint plus cross-cutting (properties, theme, image, custom XML).",
inputSchema: {
type: "object" as const,
properties: {
content_type: {
type: "string",
description: "Exact OPC content type (Content_Types.xml value).",
},
relationship_type: {
type: "string",
description: "Exact source relationship type URI.",
},
query: {
type: "string",
description:
"Case-insensitive substring across name, content type, relationship type, namespace, element, notes.",
},
},
},
},
];

export type OoxmlToolName =
Expand All @@ -138,7 +172,8 @@ export type OoxmlToolName =
| "ooxml_children"
| "ooxml_attributes"
| "ooxml_enum"
| "ooxml_namespace";
| "ooxml_namespace"
| "ooxml_package_part";

const OOXML_TOOL_NAMES: ReadonlySet<string> = new Set(OOXML_TOOL_DEFS.map((t) => t.name));

Expand Down Expand Up @@ -318,6 +353,39 @@ export async function runOoxmlTool(
});
}

case "ooxml_package_part": {
const contentType = typeof args.content_type === "string" ? args.content_type.trim() : "";
const relationshipType =
typeof args.relationship_type === "string" ? args.relationship_type.trim() : "";
const query = typeof args.query === "string" ? args.query.trim() : "";

if (contentType) {
const hit = findPartByContentType(contentType);
if (hit) return formatPackagePartReport(hit);
return formatPackagePartNotFound("content type", contentType);
}
if (relationshipType) {
const hits = findPartsByRelationshipType(relationshipType);
if (hits.length === 1) return formatPackagePartReport(hits[0]);
if (hits.length > 1) {
// Shared rels (officeDocument across WML/SML/PML, customXml
// across families) intentionally hit multiple parts.
return formatPackagePartList(hits, {
title: `Package parts using relationship '${relationshipType}'`,
query: "",
footer:
"This relationship type is shared across package families. Disambiguate by the source part (the package's main part determines whether `.../relationships/officeDocument` points at a Word, Excel, or PowerPoint main part).",
});
}
return formatPackagePartNotFound("relationship type", relationshipType);
}
const matches = searchParts(query);
return formatPackagePartList(matches, {
title: query ? `Package parts matching '${query}'` : "Curated OPC package parts",
query,
});
}

default: {
const _exhaustive: never = name;
throw new Error(`Unhandled OOXML tool: ${_exhaustive}`);
Expand Down Expand Up @@ -518,3 +586,81 @@ function formatNotFound(what: string, profile?: string, extras?: NotFoundExtras)
lines.push("- a different profile (currently only `transitional` is populated)");
return lines.join("\n");
}

function formatPackagePartReport(p: OpcPart): string {
const lines: string[] = [];
lines.push(`## OPC Part: ${p.name}`);
lines.push("");
lines.push(`- key: \`${p.key}\``);
const cts = contentTypesOf(p);
if (cts.length === 1) {
lines.push(`- content type: \`${cts[0]}\``);
} else {
lines.push(`- content types: ${cts.map((c) => `\`${c}\``).join(", ")}`);
}
lines.push(
`- source relationship: ${p.relationshipType ? `\`${p.relationshipType}\`` : "_(implicit, none)_"}`,
);
lines.push(
`- root namespace: ${p.rootNamespace ? `\`${p.rootNamespace}\`` : "_(none; binary or arbitrary-XML payload)_"}`,
);
lines.push(`- root element: ${p.rootElement ? `\`${p.rootElement}\`` : "_(none)_"}`);
lines.push(`- typical paths: ${p.typicalPaths.map((t) => `\`${t}\``).join(", ")}`);
lines.push(`- package families: ${p.packageFamilies.join(", ")}`);
lines.push(`- spec: ${p.sourceSections.join("; ")}`);
if (p.notes) {
lines.push("");
lines.push(`**Notes**: ${p.notes}`);
}
return lines.join("\n");
}

function formatPackagePartList(
matches: readonly OpcPart[],
opts: { title: string; query: string; footer?: string },
): string {
const lines: string[] = [];
lines.push(`## ${opts.title}`);
lines.push("");
if (matches.length === 0) {
lines.push("_(no matches)_");
lines.push("");
lines.push(
"Try `ooxml_package_part` with no args to see the full list, or `ooxml_search` for prose references.",
);
return lines.join("\n");
}
lines.push("| key | name | content type | families |");
lines.push("| --- | --- | --- | --- |");
for (const p of matches) {
const cts = contentTypesOf(p);
// Show first canonical type plus a "+N" indicator if there are more,
// so the table stays compact for image/* and similar enumerated sets.
const ctCell = cts.length === 1 ? `\`${cts[0]}\`` : `\`${cts[0]}\` _(+${cts.length - 1} more)_`;
lines.push(`| \`${p.key}\` | ${p.name} | ${ctCell} | ${p.packageFamilies.join(", ")} |`);
}
lines.push("");
lines.push(
opts.footer ??
"Pass an exact `content_type` or `relationship_type` for the full report on a single part.",
);
return lines.join("\n");
}

function formatPackagePartNotFound(
kind: "content type" | "relationship type",
value: string,
): string {
const lines: string[] = [];
lines.push(`## Not found: OPC part with ${kind} '${value}'`);
lines.push("");
lines.push("Try one of:");
lines.push(
"- `ooxml_package_part` with a `query` substring (e.g. 'styles', 'customXml', 'theme')",
);
lines.push("- `ooxml_package_part` with no args to list every curated part");
lines.push(
"- `ooxml_search` if the part type is documented in spec prose but not yet curated here",
);
return lines.join("\n");
}
Loading