-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: comprehensive crash guards for malformed LLM output #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -119,7 +119,7 @@ def toc_detector_single_page(content, model=None): | |||||||||
| response = llm_completion(model=model, prompt=prompt) | ||||||||||
| # print('response', response) | ||||||||||
| json_content = extract_json(response) | ||||||||||
| return json_content['toc_detected'] | ||||||||||
| return json_content.get('toc_detected', 'no') | ||||||||||
|
|
||||||||||
|
|
||||||||||
| def check_if_toc_extraction_is_complete(content, toc, model=None): | ||||||||||
|
|
@@ -137,7 +137,7 @@ def check_if_toc_extraction_is_complete(content, toc, model=None): | |||||||||
| prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc | ||||||||||
| response = llm_completion(model=model, prompt=prompt) | ||||||||||
| json_content = extract_json(response) | ||||||||||
| return json_content['completed'] | ||||||||||
| return json_content.get('completed', 'no') | ||||||||||
|
Comment on lines
137
to
+140
|
||||||||||
|
|
||||||||||
|
|
||||||||||
| def check_if_toc_transformation_is_complete(content, toc, model=None): | ||||||||||
|
|
@@ -155,7 +155,7 @@ def check_if_toc_transformation_is_complete(content, toc, model=None): | |||||||||
| prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc | ||||||||||
| response = llm_completion(model=model, prompt=prompt) | ||||||||||
| json_content = extract_json(response) | ||||||||||
| return json_content['completed'] | ||||||||||
| return json_content.get('completed', 'no') | ||||||||||
|
Comment on lines
155
to
+158
|
||||||||||
|
|
||||||||||
| def extract_toc_content(content, model=None): | ||||||||||
| prompt = f""" | ||||||||||
|
|
@@ -217,7 +217,7 @@ def detect_page_index(toc_content, model=None): | |||||||||
|
|
||||||||||
| response = llm_completion(model=model, prompt=prompt) | ||||||||||
| json_content = extract_json(response) | ||||||||||
| return json_content['page_index_given_in_toc'] | ||||||||||
| return json_content.get('page_index_given_in_toc', 'no') | ||||||||||
|
||||||||||
| return json_content.get('page_index_given_in_toc', 'no') | |
| if isinstance(json_content, dict): | |
| return json_content.get('page_index_given_in_toc', 'no') | |
| return 'no' |
Copilot
AI
Apr 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert_page_to_int() assumes each element is a dict and will misbehave or crash if table_of_contents contains non-dicts (e.g., list of strings from malformed LLM output). Since toc_transformer() returns this value and downstream code assumes dict items, consider validating that table_of_contents is a list of dicts (filtering or coercing) before calling convert_page_to_int() / returning.
Copilot
AI
Apr 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process_none_page_numbers() still assumes each TOC item is a dict and that a 'page' key exists (later del item_copy['page'] / del item['page']). With malformed LLM output (missing keys or non-dict items), this can raise (e.g., TypeError/KeyError) before the later meta_processor dict-filter runs. Consider adding isinstance(item, dict) checks and using pop('page', None) instead of unconditional del.
Copilot
AI
Apr 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_json() can return a non-dict (e.g., a JSON list), in which case json_content.get('physical_index') will raise. Add an isinstance(json_content, dict) guard (and default None) before accessing physical_index to keep this fixer resilient to malformed LLM output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_json()can return a non-dict (e.g., a JSON list when the LLM outputs[...]). In that case, calling.get(...)will raiseAttributeErrorand reintroduce a crash path. Consider guarding withisinstance(json_content, dict)(and defaulting to'no') before accessingtoc_detected.