fix: handle ChatGPT bulk export list format and raise file size limit#940
fix: handle ChatGPT bulk export list format and raise file size limit#940Deen-Wong wants to merge 2 commits intoMemPalace:developfrom
Conversation
- normalize.py: _try_chatgpt_json now handles list of conversations (ChatGPT bulk export format) in addition to single conversation dict. Adds _try_chatgpt_single() as the per-conversation parser. - convo_miner.py: raise MAX_FILE_SIZE from 10MB to 200MB. ChatGPT exports often exceed 10MB per file, causing silent skips. Tested against 208 conversations producing 6281 drawers.
|
The ChatGPT list-of-conversations handling looks like a valid addition. A couple of concerns with the rest of the diff though: strip_noise() removal -- this was deliberately added across three commits (9b99c13, ca2598a, 7e5eeda) and has a Slack sanitization removal -- the MAX_FILE_SIZE 200 MB -- #396 was specifically about OOM on large transcript files. 20x the current limit risks reintroducing that. The comment on line 58 still says "10 MB" too. #924 already adds SKIP logging so users know when files are skipped. Would it make sense to split the ChatGPT list handling into its own focused PR? The normalize.py changes unrelated to that feature seem risky to land together. |
|
Thanks for the detailed review — these are valid concerns. You're right that copying the full normalize.py from my local venv On MAX_FILE_SIZE: fair point on the OOM risk. Would 50MB be a I'll revert to a targeted change — only the ChatGPT list handling |
|
Thanks for the detailed review — these are valid concerns. |
- normalize.py: _try_chatgpt_json now handles list of conversations (ChatGPT bulk export format) in addition to single conversation dict. Adds _try_chatgpt_single() as the per-conversation parser. strip_noise() and Slack sanitization left intact. - convo_miner.py: raise MAX_FILE_SIZE from 10MB to 50MB. ChatGPT exports often exceed 10MB per file causing silent skips. Updated comment to match new value. Tested against 208 conversations producing 6281 drawers.
b45c5db to
da90736
Compare
What this fixes
Two bugs that prevent ChatGPT bulk exports from being mined correctly.
Bug 1: File size limit too low (convo_miner.py)
MAX_FILE_SIZE was set to 10MB. ChatGPT exports split conversation history across multiple files, many of which exceed 10MB. Files over the limit are silently skipped with no warning to the user.
Fix: raised to 200MB.
Bug 2: ChatGPT list format not handled (normalize.py)
_try_chatgpt_json() only handled a single conversation object (dict with mapping key). ChatGPT bulk exports are lists of conversation objects. When a list was passed, the function returned None and fell through to _try_slack_json(), producing 0 or 1 drawers instead of hundreds.
Fix: added list handling to _try_chatgpt_json() and extracted _try_chatgpt_single() as the per-conversation parser.
How to test
Export your ChatGPT history and run: