Add TTL/LRU/cap-managed cache behavior for multi CSV profiling#53
Add TTL/LRU/cap-managed cache behavior for multi CSV profiling#53
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dcc3f908d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def _cleanup_rebuildable_intermediates(index: dict[str, Any]) -> None: | ||
| candidates = [ | ||
| *CACHE_DIR.glob('*.tmp'), |
There was a problem hiding this comment.
Avoid deleting active cache temp files during cleanup
_cleanup_rebuildable_intermediates removes every *.tmp file in CACHE_DIR, but _write_json_maybe_stream uses *.tmp as the live write target before replace(). When analyze_multiple_csv(..., max_workers>1) runs, one worker can enter _enforce_cache_limits() and delete another worker’s in-flight temp file, causing tmp_path.replace(path) to raise FileNotFoundError and fail the analysis run.
Useful? React with 👍 / 👎.
| index = _load_cache_index() | ||
| entries = index.setdefault('entries', {}) |
There was a problem hiding this comment.
Serialize cache-index updates to prevent lost entries
_record_cache_access does a read-modify-write of the whole index without synchronization, so concurrent workers can each load an older snapshot and then overwrite each other’s updates in _save_cache_index. This drops valid cache entries from the index, which breaks LRU/TTL accounting and allows on-disk cache files to evade capacity eviction.
Useful? React with 👍 / 👎.
Motivation
Description
bitnet_tools/multi_csv.py:CACHE_ENTRY_TTL_SECONDS,CACHE_MAX_TOTAL_BYTES, cache index file (multi_csv_cache_index.json), and_cache_path/_index utilities to tracklast_accessandsize_bytesfor entries.*.tmp,*.partial,*.bak).JSONEncoder.iterencode) with atomic temp-file replace to avoid large in-memory allocations, and added an env overrideBITNET_CACHE_STREAM_WRITE.tests/test_analysis.pyverifying TTL expiry and capacity/LRU eviction and adjusted expectations accordingly.Testing
pytest tests/test_analysis.py -qandpytest tests/test_cli.py::test_cli_multi_analyze_no_cache_flag -q, and all tests passed.test_multi_csv_cache_ttl_expires_entryandtest_multi_csv_cache_capacity_uses_lru_eviction, both included in the test run and passing.21 passed(test suite run locally against modified code).Codex Task