Recognize whole-line bold as level-1 heading in markdown parser#250
Recognize whole-line bold as level-1 heading in markdown parser#250
Conversation
extract_nodes_from_markdown now matches `**Title**` lines as level-1
headings (alongside ATX `#` headings) and attaches the heading level
on the producer side. extract_node_text_content reads the level from
the node instead of re-running a `^#{1,6}` regex on the source line,
which was silently dropping bold-heading nodes from OCR / MinerU output.
Bold maps to level 1 even when mixed with `#` / `##` / `###` — bold-as-
heading is a courtesy heuristic for non-ATX markdown sources, and
CommonMark has no concept of bold heading depth.
There was a problem hiding this comment.
Pull request overview
This PR extends the Markdown-to-tree parser to recognize whole-line bold (**Title**) as a level-1 heading, improving node extraction for OCR/PDF-to-Markdown outputs that emit headings as bold lines.
Changes:
- Added detection of whole-line bold as a level-1 heading during node extraction.
- Persisted computed heading
levelin the extracted node metadata rather than re-deriving it later. - Simplified
extract_node_text_contentto trust the precomputedlevelfield.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bold_match = re.match(bold_heading_pattern, stripped_line) | ||
| if bold_match: | ||
| title = bold_match.group(1).strip() | ||
| node_list.append({'node_title': title, 'line_num': line_num, 'level': 1}) |
There was a problem hiding this comment.
bold_heading_pattern will match lines like ** ** (only whitespace inside the bold markers). Because the capture group is stripped before storing, this can create nodes with an empty node_title, which later yields empty titles in the tree. Consider tightening the regex to require non-whitespace content inside **...**, or add a guard to skip appending when the stripped title is empty.
Summary
extract_nodes_from_markdownnow matches whole-line bold (**Title**) as a level-1 heading, in addition to the existing ATX (#) headings. Markdown produced by Mistral OCR and similar PDF-to-markdown pipelines often emits visual headings as bold lines, which previously yielded zero nodes and an empty tree.levelonto eachnode_listentry on the way out.extract_node_text_contentreadsnode['level']instead of re-running the^(#{1,6})regex on the source line. The old re-derivation silently dropped any heading whose syntax it did not recognize.^\*\*(.+?)\*\*\s*$), so inline emphasis likeSome text with **bold** in the middleis intentionally not recognized as a heading.Mixed-document semantics
# Headingand**Bold**, bold maps to level 1, sibling of# Heading. CommonMark has no concept of bold-heading depth, so a flat level-1 mapping is the only non-arbitrary rule.## Subattaches to whichever level-1 ancestor is most recent in source order, which may be the**Bold**line rather than the earlier# Heading. This is documented heuristic behavior of the bold-as-heading extension and is not a bug.Scope
pageindex/page_index_md.pyonly. The PDF path (page_index.py) is fully LLM-driven and does not touch this regex; it is unchanged.requirements.txt.Validation
**Bold**markdown that previously returned an empty tree now produces a non-empty tree with one level-1 node per bold line.Some text with **bold** insideare not picked up as headings.#+**input attaches the right child nodes under the most recent level-1 ancestor.#regression baseline preserved (level extraction unchanged for ATX headings).