Skip to content

Add TTL/LRU/cap-managed cache behavior for multi CSV profiling#53

Merged
rad1092 merged 1 commit intomainfrom
codex/apply-ttl-and-lru-to-cache
Feb 15, 2026
Merged

Add TTL/LRU/cap-managed cache behavior for multi CSV profiling#53
rad1092 merged 1 commit intomainfrom
codex/apply-ttl-and-lru-to-cache

Conversation

@rad1092
Copy link
Copy Markdown
Owner

@rad1092 rad1092 commented Feb 15, 2026

Motivation

  • Provide bounded, self-governing caching for CSV profiling to limit disk growth and stale entries by introducing TTL, total-cap, and LRU eviction.
  • Prefer removing easy-to-regenerate intermediate files first to free space without losing important artifacts.
  • Avoid memory spikes when serializing large cached payloads by using a streaming/chunked write path.
  • Keep defaults conservative and tunable so performance is not degraded by the new behavior.

Description

  • Added cache governance constants and helpers in bitnet_tools/multi_csv.py: CACHE_ENTRY_TTL_SECONDS, CACHE_MAX_TOTAL_BYTES, cache index file (multi_csv_cache_index.json), and _cache_path/_index utilities to track last_access and size_bytes for entries.
  • Implemented TTL expiration, LRU eviction when total cache size exceeds the cap, and prioritized cleanup of rebuildable intermediates (*.tmp, *.partial, *.bak).
  • Switched cache serialization to a streaming JSON write (JSONEncoder.iterencode) with atomic temp-file replace to avoid large in-memory allocations, and added an env override BITNET_CACHE_STREAM_WRITE.
  • Hardened concurrent index writes by using per-process/thread temp filenames and ensured index updates when entries are removed; added small helper functions for recording access and removing entries.
  • Added tests in tests/test_analysis.py verifying TTL expiry and capacity/LRU eviction and adjusted expectations accordingly.

Testing

  • Ran pytest tests/test_analysis.py -q and pytest tests/test_cli.py::test_cli_multi_analyze_no_cache_flag -q, and all tests passed.
  • New unit tests added: test_multi_csv_cache_ttl_expires_entry and test_multi_csv_cache_capacity_uses_lru_eviction, both included in the test run and passing.
  • Final test outcome: 21 passed (test suite run locally against modified code).

Codex Task

@rad1092 rad1092 merged commit b8c194d into main Feb 15, 2026
4 checks passed
@rad1092 rad1092 deleted the codex/apply-ttl-and-lru-to-cache branch February 15, 2026 07:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dcc3f908d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bitnet_tools/multi_csv.py

def _cleanup_rebuildable_intermediates(index: dict[str, Any]) -> None:
candidates = [
*CACHE_DIR.glob('*.tmp'),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deleting active cache temp files during cleanup

_cleanup_rebuildable_intermediates removes every *.tmp file in CACHE_DIR, but _write_json_maybe_stream uses *.tmp as the live write target before replace(). When analyze_multiple_csv(..., max_workers>1) runs, one worker can enter _enforce_cache_limits() and delete another worker’s in-flight temp file, causing tmp_path.replace(path) to raise FileNotFoundError and fail the analysis run.

Useful? React with 👍 / 👎.

Comment thread bitnet_tools/multi_csv.py
Comment on lines +349 to +350
index = _load_cache_index()
entries = index.setdefault('entries', {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize cache-index updates to prevent lost entries

_record_cache_access does a read-modify-write of the whole index without synchronization, so concurrent workers can each load an older snapshot and then overwrite each other’s updates in _save_cache_index. This drops valid cache entries from the index, which breaks LRU/TTL accounting and allows on-disk cache files to evade capacity eviction.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant