Skip to content

fix: handle ChatGPT bulk export list format and raise file size limit#940

Open
Deen-Wong wants to merge 2 commits intoMemPalace:developfrom
Deen-Wong:fix/chatgpt-list-format-and-file-size-limit
Open

fix: handle ChatGPT bulk export list format and raise file size limit#940
Deen-Wong wants to merge 2 commits intoMemPalace:developfrom
Deen-Wong:fix/chatgpt-list-format-and-file-size-limit

Conversation

@Deen-Wong
Copy link
Copy Markdown

@Deen-Wong Deen-Wong commented Apr 16, 2026

What this fixes

Two bugs that prevent ChatGPT bulk exports from being mined correctly.

Bug 1: File size limit too low (convo_miner.py)

MAX_FILE_SIZE was set to 10MB. ChatGPT exports split conversation history across multiple files, many of which exceed 10MB. Files over the limit are silently skipped with no warning to the user.

Fix: raised to 200MB.

Bug 2: ChatGPT list format not handled (normalize.py)

_try_chatgpt_json() only handled a single conversation object (dict with mapping key). ChatGPT bulk exports are lists of conversation objects. When a list was passed, the function returned None and fell through to _try_slack_json(), producing 0 or 1 drawers instead of hundreds.

Fix: added list handling to _try_chatgpt_json() and extracted _try_chatgpt_single() as the per-conversation parser.

How to test

Export your ChatGPT history and run:

- normalize.py: _try_chatgpt_json now handles list of conversations
  (ChatGPT bulk export format) in addition to single conversation dict.
  Adds _try_chatgpt_single() as the per-conversation parser.

- convo_miner.py: raise MAX_FILE_SIZE from 10MB to 200MB.
  ChatGPT exports often exceed 10MB per file, causing silent skips.

Tested against 208 conversations producing 6281 drawers.
@mvalentsev
Copy link
Copy Markdown
Contributor

The ChatGPT list-of-conversations handling looks like a valid addition. A couple of concerns with the rest of the diff though:

strip_noise() removal -- this was deliberately added across three commits (9b99c13, ca2598a, 7e5eeda) and has a NORMALIZE_VERSION schema gate in palace.py so existing drawers get silently rebuilt. Removing it means Claude Code system tags, hook output, and UI chrome all end up in drawers again and pollute search results.

Slack sanitization removal -- the re.sub on user IDs guards against chunk-boundary injection via crafted exports, and the [{user_id}] prefix preserves who said what in multi-party chats. Dropping both is a security/data regression.

MAX_FILE_SIZE 200 MB -- #396 was specifically about OOM on large transcript files. 20x the current limit risks reintroducing that. The comment on line 58 still says "10 MB" too. #924 already adds SKIP logging so users know when files are skipped.

Would it make sense to split the ChatGPT list handling into its own focused PR? The normalize.py changes unrelated to that feature seem risky to land together.

@Deen-Wong
Copy link
Copy Markdown
Author

Thanks for the detailed review — these are valid concerns.

You're right that copying the full normalize.py from my local venv
inadvertently included changes beyond the ChatGPT list fix. The
strip_noise() removal and Slack sanitization changes were not
intentional — I should have done a surgical diff instead.

On MAX_FILE_SIZE: fair point on the OOM risk. Would 50MB be a
reasonable middle ground, or is there a better approach given #924
adds SKIP logging?

I'll revert to a targeted change — only the ChatGPT list handling
in normalize.py and the file size adjustment in convo_miner.py,
leaving strip_noise() and Slack sanitization intact.

@Deen-Wong
Copy link
Copy Markdown
Author

Thanks for the detailed review — these are valid concerns.
The strip_noise() removal and Slack sanitization changes were unintentional — I copied the full normalize.py from my local patched venv instead of doing a surgical diff. I've reverted those and the new commit contains only the ChatGPT list handling.
On MAX_FILE_SIZE: raised to 50MB as a middle ground rather than 200MB. Happy to defer to whatever the maintainers prefer, given the OOM history in #396 — the comment on line 58 is also updated to match.
The diff is now focused: only _try_chatgpt_json and _try_chatgpt_single in normalize.py, and the file size constant in convo_miner.py.

- normalize.py: _try_chatgpt_json now handles list of conversations
  (ChatGPT bulk export format) in addition to single conversation dict.
  Adds _try_chatgpt_single() as the per-conversation parser.
  strip_noise() and Slack sanitization left intact.

- convo_miner.py: raise MAX_FILE_SIZE from 10MB to 50MB.
  ChatGPT exports often exceed 10MB per file causing silent skips.
  Updated comment to match new value.

Tested against 208 conversations producing 6281 drawers.
@Deen-Wong Deen-Wong force-pushed the fix/chatgpt-list-format-and-file-size-limit branch from b45c5db to da90736 Compare April 16, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants