Recognize whole-line bold as level-1 heading in markdown parser by BukeLy · Pull Request #250 · VectifyAI/PageIndex

BukeLy · 2026-04-28T07:33:07Z

Summary

extract_nodes_from_markdown now matches whole-line bold (**Title**) as a level-1 heading, in addition to the existing ATX (#) headings. Markdown produced by Mistral OCR and similar PDF-to-markdown pipelines often emits visual headings as bold lines, which previously yielded zero nodes and an empty tree.
The producer attaches the heading level onto each node_list entry on the way out. extract_node_text_content reads node['level'] instead of re-running the ^(#{1,6}) regex on the source line. The old re-derivation silently dropped any heading whose syntax it did not recognize.
The bold pattern is anchored to the whole stripped line (^\*\*(.+?)\*\*\s*$), so inline emphasis like Some text with **bold** in the middle is intentionally not recognized as a heading.

Mixed-document semantics

If a document mixes # Heading and **Bold**, bold maps to level 1, sibling of # Heading. CommonMark has no concept of bold-heading depth, so a flat level-1 mapping is the only non-arbitrary rule.
A subsequent ## Sub attaches to whichever level-1 ancestor is most recent in source order, which may be the **Bold** line rather than the earlier # Heading. This is documented heuristic behavior of the bold-as-heading extension and is not a bug.

Scope

pageindex/page_index_md.py only. The PDF path (page_index.py) is fully LLM-driven and does not touch this regex; it is unchanged.
No new third-party dependencies. No changes to requirements.txt.

Validation

Verified locally: an all-**Bold** markdown that previously returned an empty tree now produces a non-empty tree with one level-1 node per bold line.
Verified the inline-bold negative case: lines like Some text with **bold** inside are not picked up as headings.
Verified mixed # + ** input attaches the right child nodes under the most recent level-1 ancestor.
Verified all-# regression baseline preserved (level extraction unchanged for ATX headings).

extract_nodes_from_markdown now matches `**Title**` lines as level-1 headings (alongside ATX `#` headings) and attaches the heading level on the producer side. extract_node_text_content reads the level from the node instead of re-running a `^#{1,6}` regex on the source line, which was silently dropping bold-heading nodes from OCR / MinerU output. Bold maps to level 1 even when mixed with `#` / `##` / `###` — bold-as- heading is a courtesy heuristic for non-ATX markdown sources, and CommonMark has no concept of bold heading depth.

Copilot

Pull request overview

This PR extends the Markdown-to-tree parser to recognize whole-line bold (**Title**) as a level-1 heading, improving node extraction for OCR/PDF-to-Markdown outputs that emit headings as bold lines.

Changes:

Added detection of whole-line bold as a level-1 heading during node extraction.
Persisted computed heading level in the extracted node metadata rather than re-deriving it later.
Simplified extract_node_text_content to trust the precomputed level field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T07:38:14Z

+            bold_match = re.match(bold_heading_pattern, stripped_line)
+            if bold_match:
+                title = bold_match.group(1).strip()
+                node_list.append({'node_title': title, 'line_num': line_num, 'level': 1})


bold_heading_pattern will match lines like ** ** (only whitespace inside the bold markers). Because the capture group is stripped before storing, this can create nodes with an empty node_title, which later yields empty titles in the tree. Consider tightening the regex to require non-whitespace content inside **...**, or add a guard to skip appending when the stripped title is empty.

BukeLy requested a review from Copilot April 28, 2026 07:34

Copilot started reviewing on behalf of BukeLy April 28, 2026 07:36 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize whole-line bold as level-1 heading in markdown parser#250

Recognize whole-line bold as level-1 heading in markdown parser#250
BukeLy wants to merge 1 commit intomainfrom
feat/md-bold-heading-recognition

BukeLy commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BukeLy commented Apr 28, 2026

Summary

Mixed-document semantics

Scope

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants