Skip to content

Recognize whole-line bold as level-1 heading in markdown parser#250

Open
BukeLy wants to merge 1 commit intomainfrom
feat/md-bold-heading-recognition
Open

Recognize whole-line bold as level-1 heading in markdown parser#250
BukeLy wants to merge 1 commit intomainfrom
feat/md-bold-heading-recognition

Conversation

@BukeLy
Copy link
Copy Markdown
Collaborator

@BukeLy BukeLy commented Apr 28, 2026

Summary

  • extract_nodes_from_markdown now matches whole-line bold (**Title**) as a level-1 heading, in addition to the existing ATX (#) headings. Markdown produced by Mistral OCR and similar PDF-to-markdown pipelines often emits visual headings as bold lines, which previously yielded zero nodes and an empty tree.
  • The producer attaches the heading level onto each node_list entry on the way out. extract_node_text_content reads node['level'] instead of re-running the ^(#{1,6}) regex on the source line. The old re-derivation silently dropped any heading whose syntax it did not recognize.
  • The bold pattern is anchored to the whole stripped line (^\*\*(.+?)\*\*\s*$), so inline emphasis like Some text with **bold** in the middle is intentionally not recognized as a heading.

Mixed-document semantics

  • If a document mixes # Heading and **Bold**, bold maps to level 1, sibling of # Heading. CommonMark has no concept of bold-heading depth, so a flat level-1 mapping is the only non-arbitrary rule.
  • A subsequent ## Sub attaches to whichever level-1 ancestor is most recent in source order, which may be the **Bold** line rather than the earlier # Heading. This is documented heuristic behavior of the bold-as-heading extension and is not a bug.

Scope

  • pageindex/page_index_md.py only. The PDF path (page_index.py) is fully LLM-driven and does not touch this regex; it is unchanged.
  • No new third-party dependencies. No changes to requirements.txt.

Validation

  • Verified locally: an all-**Bold** markdown that previously returned an empty tree now produces a non-empty tree with one level-1 node per bold line.
  • Verified the inline-bold negative case: lines like Some text with **bold** inside are not picked up as headings.
  • Verified mixed # + ** input attaches the right child nodes under the most recent level-1 ancestor.
  • Verified all-# regression baseline preserved (level extraction unchanged for ATX headings).

extract_nodes_from_markdown now matches `**Title**` lines as level-1
headings (alongside ATX `#` headings) and attaches the heading level
on the producer side. extract_node_text_content reads the level from
the node instead of re-running a `^#{1,6}` regex on the source line,
which was silently dropping bold-heading nodes from OCR / MinerU output.

Bold maps to level 1 even when mixed with `#` / `##` / `###` — bold-as-
heading is a courtesy heuristic for non-ATX markdown sources, and
CommonMark has no concept of bold heading depth.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the Markdown-to-tree parser to recognize whole-line bold (**Title**) as a level-1 heading, improving node extraction for OCR/PDF-to-Markdown outputs that emit headings as bold lines.

Changes:

  • Added detection of whole-line bold as a level-1 heading during node extraction.
  • Persisted computed heading level in the extracted node metadata rather than re-deriving it later.
  • Simplified extract_node_text_content to trust the precomputed level field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +65
bold_match = re.match(bold_heading_pattern, stripped_line)
if bold_match:
title = bold_match.group(1).strip()
node_list.append({'node_title': title, 'line_num': line_num, 'level': 1})
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bold_heading_pattern will match lines like ** ** (only whitespace inside the bold markers). Because the capture group is stripped before storing, this can create nodes with an empty node_title, which later yields empty titles in the tree. Consider tightening the regex to require non-whitespace content inside **...**, or add a guard to skip appending when the stripped title is empty.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants