Skip to content

Tiered content loading (L0/L1/L2) for search results #61

@lesaai

Description

@lesaai

Summary

Store three resolution levels for each chunk. Search returns the smallest useful level first. Load deeper levels on demand. Reduces token consumption by up to 90% on search results.

Proposed layers

Layer Size Purpose
L0 (Abstract) ~100 tokens One-sentence summary. Vector search and quick relevance check.
L1 (Overview) ~1-2K tokens Key points, enough for planning. "Use this first."
L2 (Detail) Full chunk Original text. Loaded only when confirmed needed.

How it would work

Ingestion: When crystal.ingest() stores a chunk (L2), also generate and store L0 and L1 via LLM summarization.

Search: crystal_search returns L0 by default. New flag: --depth L1 or --depth L2 for deeper results.

Agent workflow: Agent sees 5 L0 abstracts (500 tokens total). Picks the 2 relevant ones. Loads L1 for those (4K tokens). Digs into L2 for the one that matters (full chunk). Total: ~5K tokens instead of ~25K.

What changes

  • memory-crystal-private/src/core.ts ... add l0_text, l1_text columns. Generate on ingest.
  • memory-crystal-private/src/mcp-server.ts ... add depth parameter to crystal_search tool
  • memory-crystal-private/src/search-pipeline.ts ... return appropriate depth level
  • memory-crystal-private/src/llm.ts ... add summarization prompts for L0/L1 generation

Caveats (read before implementing)

  • Massive LLM cost on ingestion. Every chunk now needs 2 LLM calls (generate L0 + L1) at ingest time. With 208K existing chunks, backfilling is expensive. For new chunks, it adds ~2 seconds per chunk. Is the token savings on search worth the ingestion cost?
  • Defer until the relay is real. The token savings matter most on iOS/cloud where context is tight. On desktop CLI, returning full chunks is fine. This is a relay optimization, not a local one. Don't build it until #163 (relay) is in progress.
  • L0/L1 quality depends on LLM quality. A bad summary is worse than no summary. If L0 says "discussion about architecture" and the agent skips it, we lost recall. Test L0 quality extensively before relying on it for filtering.
  • Simpler alternative: just truncate. Instead of LLM-generated summaries, return the first N tokens of each chunk as the "preview." No LLM cost. Worse quality, but maybe good enough. Test this first.
  • Schema migration on 208K rows is non-trivial. Adding columns is fine. Backfilling L0/L1 for existing chunks is a multi-hour LLM job. Consider: only generate L0/L1 for new chunks. Old chunks return L2 always.

Inspiration

OpenViking's tiered context loading: L0 abstract (~100 tokens), L1 overview (~2K tokens), L2 full content. Bottom-up semantic generation. (Apache 2.0, Volcengine).

Related

  • #163 (Crystal relay ... this optimization matters most for relay/iOS)
  • Search pipeline: memory-crystal-private/src/search-pipeline.ts
  • Core: memory-crystal-private/src/core.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions