Skip to content

Add support for end to end in-memory QA evaluation#1865

Open
KyleZheng1284 wants to merge 3 commits intoNVIDIA:mainfrom
KyleZheng1284:full-e2e-eval
Open

Add support for end to end in-memory QA evaluation#1865
KyleZheng1284 wants to merge 3 commits intoNVIDIA:mainfrom
KyleZheng1284:full-e2e-eval

Conversation

@KyleZheng1284
Copy link
Copy Markdown
Member

Description

Added an in-memory path so you can run the full QA eval pipeline (ingest through scoring) in one command without intermediate files, alongside the existing separable export+eval workflow

Built off of this existing PR #1754

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@KyleZheng1284 KyleZheng1284 requested review from a team as code owners April 15, 2026 23:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

Adds an end-to-end in-memory QA evaluation path: FileRetriever._from_dict() enables in-process construction from LanceDB query results, graph_pipeline.py gains an --evaluation-mode qa branch with solid pre-flight validation, and cli.py exposes retriever eval run/export/build-page-index subcommands. The new path works correctly and the README documentation is thorough.

  • The new _from_dict factory and the from_lancedb in-memory path have no corresponding unit tests, despite being the core of the feature. Per the test-coverage-new-code rule these should be covered before merging.
  • _build_env_config (~97 lines) and run_cmd (~118 lines) both exceed the 50-line SRP limit; _build_env_config in particular handles four distinct concerns that would benefit from decomposition.

Confidence Score: 4/5

Safe to merge once the missing unit tests for the in-memory retriever path are added; remaining findings are style/documentation P2s.

One P1 finding (missing tests for new FileRetriever._from_dict / from_lancedb in-memory path per test-coverage-new-code rule) prevents a 5. Logic is correct, pre-flight validation is solid, and the previously raised concerns are addressed.

nemo_retriever/src/nemo_retriever/evaluation/retrievers.py — new _from_dict factory has no tests; nemo_retriever/src/nemo_retriever/evaluation/cli.py — _build_env_config and run_cmd are oversized.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/evaluation/retrievers.py Adds _from_dict() private factory and updates from_lancedb() to use in-memory construction; logic is correct but from_lancedb's docstring lacks a Returns section and the new private factory has no tests.
nemo_retriever/src/nemo_retriever/evaluation/cli.py Adds run, export, and build-page-index Typer subcommands; run_cmd (~118 lines) and _build_env_config (~97 lines) both exceed the 50-line SRP limit, and there are no corresponding tests.
nemo_retriever/src/nemo_retriever/evaluation/config.py Adds heterogeneous-judge warning in _normalize_config; logic is correct and the previously-flagged concern about silently applying only the first judge is an existing limitation documented here.
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py Adds pre-flight validation and QA eval path (evaluation_mode == "qa"); pre-flight guards for eval_config and file existence are solid, but the QA sweep result loop at line 762-771 does not propagate failures via exit code.
nemo_retriever/src/nemo_retriever/evaluation/README.md Comprehensive documentation for the new end-to-end in-memory QA path, including quick-reference commands and comparison table; no code changes.

Sequence Diagram

sequenceDiagram
    participant U as User CLI
    participant GP as graph_pipeline.py
    participant FR as FileRetriever
    participant LDB as LanceDB
    participant ES as run_eval_sweep

    U->>GP: --evaluation-mode qa --eval-config ... --query-csv ...
    GP->>GP: Pre-flight: validate eval_config & query_csv exist
    GP->>GP: Ingest PDFs → result_df
    GP->>GP: Build page index (in-memory from result_df OR from --page-index file)
    GP->>FR: from_lancedb(qa_pairs, lancedb_uri, page_index, save_path)
    FR->>LDB: query_lancedb(queries, embedder, top_k)
    LDB-->>FR: all_results dict
    FR->>FR: _from_dict(all_results) → in-memory FileRetriever
    opt save_path provided
        FR->>FR: write_retrieval_json(save_path)
    end
    FR-->>GP: FileRetriever instance
    GP->>FR: check_coverage(qa_pairs)
    FR-->>GP: coverage float
    GP->>ES: run_eval_sweep(qa_cfg, qa_pairs, retriever)
    ES-->>GP: sweep_results list
    GP->>GP: Log PASS/FAIL counts + metrics
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 92-125

Comment:
**No unit tests for the new in-memory construction path**

`_from_dict()` is the core factory for the in-memory QA path and is also called by the public `from_lancedb()`. Neither path has a corresponding test. If the normalisation logic in `_from_dict` diverges from `__init__` (e.g. a future attribute is added to `__init__` but forgotten in `_from_dict`), the divergence will only surface at runtime. The `test-coverage-new-code` rule requires tests for new business logic.

Suggested additions to `nemo_retriever/tests/test_retriever_queries.py` (or a new `test_qa_eval.py`):

```python
def test_from_dict_roundtrip():
    queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
    retriever = FileRetriever._from_dict(queries)
    assert retriever.file_path == "<in-memory>"
    result = retriever.retrieve("What is X?", top_k=5)
    assert result.chunks == ["chunk1", "chunk2"]

def test_from_dict_empty_raises():
    with pytest.raises(ValueError, match="empty"):
        FileRetriever._from_dict({})

def test_from_dict_missing_chunks_raises():
    with pytest.raises(ValueError):
        FileRetriever._from_dict({"Q": {"metadata": []}})
```

**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 127-181

Comment:
**`from_lancedb` docstring missing `Returns` section**

The public classmethod `from_lancedb` has Parameter docs but no `Returns` section, which is required by the `docstrings-public-interface` rule for public interfaces.

```suggestion
    ) -> "FileRetriever":
        """Query LanceDB in-memory, optionally save, return a FileRetriever.

        Reuses :func:`~nemo_retriever.export.query_lancedb` for batched
        vector search and :func:`~nemo_retriever.export.write_retrieval_json`
        for optional disk persistence.

        Parameters
        ----------
        qa_pairs : list[dict]
            Ground-truth pairs; each must have a ``"query"`` key.
        lancedb_uri : str
            Path to the LanceDB directory.
        lancedb_table : str
            LanceDB table name.
        embedder : str
            Embedding model name for query encoding.
        top_k : int
            Number of chunks to retrieve per query.
        page_index : dict, optional
            ``{source_id: {page_str: markdown}}``.  Enables full-page
            markdown expansion when provided.
        save_path : str, optional
            If set, also writes the retrieval JSON to this path so it
            can be reloaded later via ``FileRetriever(file_path=...)``.

        Returns
        -------
        FileRetriever
            In-memory retriever populated from LanceDB results.
            ``file_path`` is set to ``save_path`` when provided,
            otherwise ``"<in-memory>"``.
        """
```

**Rule Used:** Public modules, classes, and functions must have d... ([source](https://app.greptile.com/review/custom-context?memory=docstrings-public-interface))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/cli.py
Line: 123-219

Comment:
**Functions exceed 50-line SRP limit**

`_build_env_config` is ~97 lines and `run_cmd` below is ~118 lines — both well over the 50-line guideline. `_build_env_config` handles env parsing, model list parsing, key resolution, and config assembly: four distinct responsibilities. Consider splitting it into helpers like `_parse_gen_model_pairs(gen_models_str, ...)`, `_build_retrieval_block(...)`, and `_build_models_and_evals(...)` so each piece can be tested and read independently.

**Rule Used:** Functions and classes should have a single, well-d... ([source](https://app.greptile.com/review/custom-context?memory=single-responsibility))

How can I resolve this? If you propose a fix, please make it concise.

Reviews (4): Last reviewed commit: "Add pre-flight validation and lancedb gu..." | Re-trigger Greptile

Comment on lines +698 to +707
passed = sum(1 for r in sweep_results if r["status"] == "PASS")
logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))
for r in sweep_results:
if r["status"] == "PASS":
out = Path(r["output_path"]).resolve()
logger.info("Results: %s", out)
er = r.get("eval_results", {})
judge_scores = er.get("tier3_llm_judge", {})
for gen_name, stats in judge_scores.items():
evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 QA sweep failures do not affect process exit code

When evaluation_mode == "qa", all sweep FAIL results are logged but the script exits with code 0. The cli.py counterpart raises Exit(code=1) when any run fails (if passed < len(results): raise typer.Exit(code=1)). Any CI job using --evaluation-mode qa for end-to-end validation will report success even if every evaluation run failed (e.g., missing API key, unreachable model).

Suggested change
passed = sum(1 for r in sweep_results if r["status"] == "PASS")
logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))
for r in sweep_results:
if r["status"] == "PASS":
out = Path(r["output_path"]).resolve()
logger.info("Results: %s", out)
er = r.get("eval_results", {})
judge_scores = er.get("tier3_llm_judge", {})
for gen_name, stats in judge_scores.items():
evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))
passed = sum(1 for r in sweep_results if r["status"] == "PASS")
logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))
for r in sweep_results:
if r["status"] == "PASS":
out = Path(r["output_path"]).resolve()
logger.info("Results: %s", out)
else:
logger.error("QA run FAILED: %s: %s", r.get("label", ""), r.get("error", ""))
er = r.get("eval_results", {})
judge_scores = er.get("tier3_llm_judge", {})
for gen_name, stats in judge_scores.items():
evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))
if passed < len(sweep_results):
raise ValueError(
f"QA evaluation failed: {passed}/{len(sweep_results)} runs passed. "
"Check the logs above for details."
)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py
Line: 698-707

Comment:
**QA sweep failures do not affect process exit code**

When `evaluation_mode == "qa"`, all sweep `FAIL` results are logged but the script exits with code 0. The `cli.py` counterpart raises `Exit(code=1)` when any run fails (`if passed < len(results): raise typer.Exit(code=1)`). Any CI job using `--evaluation-mode qa` for end-to-end validation will report success even if every evaluation run failed (e.g., missing API key, unreachable model).

```suggestion
            passed = sum(1 for r in sweep_results if r["status"] == "PASS")
            logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))
            for r in sweep_results:
                if r["status"] == "PASS":
                    out = Path(r["output_path"]).resolve()
                    logger.info("Results: %s", out)
                else:
                    logger.error("QA run FAILED: %s: %s", r.get("label", ""), r.get("error", ""))
                er = r.get("eval_results", {})
                judge_scores = er.get("tier3_llm_judge", {})
                for gen_name, stats in judge_scores.items():
                    evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))
            if passed < len(sweep_results):
                raise ValueError(
                    f"QA evaluation failed: {passed}/{len(sweep_results)} runs passed. "
                    "Check the logs above for details."
                )
```

How can I resolve this? If you propose a fix, please make it concise.

# _client and _model_name are captured via default args to avoid
# the late-binding closure bug across loop iterations. self.judge
# is safe to access directly since it does not change per-iteration.
def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Captured default arg _model_name is never used inside _process_row

_model_name is bound via a default argument to avoid the late-binding closure bug, but the function body never references it. Only _client is actually needed. This is dead code that adds noise without guarding anything.

Suggested change
def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name):
def _process_row(row_tuple, row_aic, _client=client):
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py
Line: 154

Comment:
**Captured default arg `_model_name` is never used inside `_process_row`**

`_model_name` is bound via a default argument to avoid the late-binding closure bug, but the function body never references it. Only `_client` is actually needed. This is dead code that adds noise without guarding anything.

```suggestion
            def _process_row(row_tuple, row_aic, _client=client):
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +363 to +396
if dataframe is not None:
df = dataframe
else:
parquet_path = Path(parquet_dir)
if not parquet_path.is_dir():
raise FileNotFoundError(f"Parquet directory not found: {parquet_path}")
single = parquet_path / "extraction.parquet"
if single.is_file():
parquet_files = [single]
else:
parquet_files = sorted(f for f in parquet_path.rglob("*.parquet") if f.name != "extraction.parquet")
if not parquet_files:
raise FileNotFoundError(f"No .parquet files in {parquet_path}")

dfs = [_read_parquet_for_markdown(f) for f in parquet_files]
df = pd.concat(dfs, ignore_index=True)

path_col = "path" if "path" in df.columns else "source_id"
if path_col not in df.columns:
raise KeyError(f"Neither 'path' nor 'source_id' found in columns: {list(df.columns)}")

list_keys = ("tables", "table", "charts", "chart", "infographics", "infographic")

docs_grouped: dict[str, list[dict]] = defaultdict(list)
for _, row in df.iterrows():
source = str(row.get(path_col, ""))
if not source:
continue
record = row.to_dict()
for key in list_keys:
val = record.get(key)
if isinstance(val, np.ndarray):
record[key] = val.tolist()
docs_grouped[source].append(record)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Large-column filtering not applied on the dataframe= path

_read_parquet_for_markdown explicitly restricts columns to _MARKDOWN_PARQUET_COLUMNS and its docstring says this "avoids multi-GB memory spikes" from embedding vectors and page images. When build_page_index is called with dataframe=result_df from graph_pipeline.py, the full ingestion DataFrame (with embeddings, base64 page images, etc.) flows through unchanged — every row.to_dict() creates a record dict carrying all columns.

For the 767-PDF use case (~84k rows), applying the same column selection before iterating would keep memory consistent with the Parquet path:

if dataframe is not None:
    available = set(dataframe.columns)
    cols = sorted(_MARKDOWN_PARQUET_COLUMNS & available)
    df = dataframe[cols] if cols else dataframe
else:
    ...
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 363-396

Comment:
**Large-column filtering not applied on the `dataframe=` path**

`_read_parquet_for_markdown` explicitly restricts columns to `_MARKDOWN_PARQUET_COLUMNS` and its docstring says this "avoids multi-GB memory spikes" from embedding vectors and page images. When `build_page_index` is called with `dataframe=result_df` from `graph_pipeline.py`, the full ingestion DataFrame (with embeddings, base64 page images, etc.) flows through unchanged — every `row.to_dict()` creates a record dict carrying all columns.

For the 767-PDF use case (~84k rows), applying the same column selection before iterating would keep memory consistent with the Parquet path:

```python
if dataframe is not None:
    available = set(dataframe.columns)
    cols = sorted(_MARKDOWN_PARQUET_COLUMNS & available)
    df = dataframe[cols] if cols else dataframe
else:
    ...
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +146 to +157
"clients for heterogeneous judges.",
len(distinct_judges),
sorted(distinct_judges),
first_judge_key,
)
config.setdefault("judge", models[first_judge_key])

elif "generators" in config and "judge" in config:
generators = config["generators"]
judge_cfg = config["judge"]
judge_name = judge_cfg.get("name", "judge")
config.setdefault("models", {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Heterogeneous judges silently collapsed to first judge for build_eval_chain/build_eval_pipeline

When the new models/evaluations format specifies different judges for different evaluations, config["judge"] is set to only the first evaluation's judge. run_eval_sweep is unaffected (it reads models[judge_name] per evaluation), but build_eval_chain and build_eval_pipeline — which both read config["judge"] — will silently apply only the first judge across all evaluations. The logger.warning fires, but the result is wrong rather than an error.

Consider raising a ValueError instead of a warning when build_eval_chain/build_eval_pipeline are called with a config that has multiple distinct judges, so callers know to switch to run_eval_sweep.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/config.py
Line: 146-157

Comment:
**Heterogeneous judges silently collapsed to first judge for `build_eval_chain`/`build_eval_pipeline`**

When the new `models`/`evaluations` format specifies different judges for different evaluations, `config["judge"]` is set to only the first evaluation's judge. `run_eval_sweep` is unaffected (it reads `models[judge_name]` per evaluation), but `build_eval_chain` and `build_eval_pipeline` — which both read `config["judge"]` — will silently apply only the first judge across all evaluations. The `logger.warning` fires, but the result is wrong rather than an error.

Consider raising a `ValueError` instead of a warning when `build_eval_chain`/`build_eval_pipeline` are called with a config that has multiple distinct judges, so callers know to switch to `run_eval_sweep`.

How can I resolve this? If you propose a fix, please make it concise.

else:
logger.info("Built page index: %d documents", len(page_idx))

qa_cfg = load_eval_config(str(eval_config))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 eval_config=None produces confusing error

When --evaluation-mode qa is used without --eval-config, eval_config is None (its default). str(None) is "None", so load_eval_config("None") runs and raises FileNotFoundError: Config file not found: None. The user has no idea they need to supply --eval-config.

Add a guard before line 729:

Suggested change
qa_cfg = load_eval_config(str(eval_config))
if eval_config is None:
logger.error("--eval-config is required when --evaluation-mode=qa")
raise typer.Exit(code=1)
qa_cfg = load_eval_config(str(eval_config))

Rule Used: Error messages must tell the user what happened AN... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py
Line: 729

Comment:
**`eval_config=None` produces confusing error**

When `--evaluation-mode qa` is used without `--eval-config`, `eval_config` is `None` (its default). `str(None)` is `"None"`, so `load_eval_config("None")` runs and raises `FileNotFoundError: Config file not found: None`. The user has no idea they need to supply `--eval-config`.

Add a guard before line 729:

```suggestion
            if eval_config is None:
                logger.error("--eval-config is required when --evaluation-mode=qa")
                raise typer.Exit(code=1)
            qa_cfg = load_eval_config(str(eval_config))
```

**Rule Used:** Error messages must tell the user what happened AN... ([source](https://app.greptile.com/review/custom-context?memory=actionable-errors))

How can I resolve this? If you propose a fix, please make it concise.

Kyle Zheng added 2 commits April 16, 2026 20:56
The eval CLI (cli.py) already aborts when retrieval coverage falls
below the configured min_coverage.  The graph_pipeline QA path
logged coverage but never enforced the threshold, creating
inconsistent behavior between the two entry points.

Made-with: Cursor
- build_eval_chain() now raises a clear ValueError when
  retrieval.type is not 'file', matching build_eval_pipeline().
- graph_pipeline validates eval_config and query_csv file existence
  before starting ingestion so a typo does not waste a full run.

Made-with: Cursor
Comment on lines +92 to +125
@classmethod
def _from_dict(cls, queries: dict[str, dict]) -> "FileRetriever":
"""Build a FileRetriever from an in-memory queries dict.

Bypasses file I/O while reusing the same normalized index that
``__init__`` builds from JSON. All instance methods (``retrieve``,
``check_coverage``) work identically afterwards.

Parameters
----------
queries : dict
``{query_text: {"chunks": [...], "metadata": [...]}}`` --
the same shape as the ``"queries"`` value in a retrieval JSON.
"""
if not queries:
raise ValueError("FileRetriever._from_dict: queries dict is empty")
sample = next(iter(queries.values()), {})
if not isinstance(sample.get("chunks"), list):
raise ValueError(
"FileRetriever._from_dict: first entry is missing a 'chunks' list. "
'Expected: {"query": {"chunks": ["..."]}}'
)

instance = object.__new__(cls)
instance.file_path = "<in-memory>"
instance._norm_index = {}
instance._raw_keys = {}
instance._miss_count = 0
instance._miss_lock = threading.Lock()
for raw_key, value in queries.items():
norm = _normalize_query(raw_key)
instance._norm_index[norm] = value
instance._raw_keys[norm] = raw_key
return instance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 No unit tests for the new in-memory construction path

_from_dict() is the core factory for the in-memory QA path and is also called by the public from_lancedb(). Neither path has a corresponding test. If the normalisation logic in _from_dict diverges from __init__ (e.g. a future attribute is added to __init__ but forgotten in _from_dict), the divergence will only surface at runtime. The test-coverage-new-code rule requires tests for new business logic.

Suggested additions to nemo_retriever/tests/test_retriever_queries.py (or a new test_qa_eval.py):

def test_from_dict_roundtrip():
    queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
    retriever = FileRetriever._from_dict(queries)
    assert retriever.file_path == "<in-memory>"
    result = retriever.retrieve("What is X?", top_k=5)
    assert result.chunks == ["chunk1", "chunk2"]

def test_from_dict_empty_raises():
    with pytest.raises(ValueError, match="empty"):
        FileRetriever._from_dict({})

def test_from_dict_missing_chunks_raises():
    with pytest.raises(ValueError):
        FileRetriever._from_dict({"Q": {"metadata": []}})

Rule Used: New functionality must include corresponding unit ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 92-125

Comment:
**No unit tests for the new in-memory construction path**

`_from_dict()` is the core factory for the in-memory QA path and is also called by the public `from_lancedb()`. Neither path has a corresponding test. If the normalisation logic in `_from_dict` diverges from `__init__` (e.g. a future attribute is added to `__init__` but forgotten in `_from_dict`), the divergence will only surface at runtime. The `test-coverage-new-code` rule requires tests for new business logic.

Suggested additions to `nemo_retriever/tests/test_retriever_queries.py` (or a new `test_qa_eval.py`):

```python
def test_from_dict_roundtrip():
    queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
    retriever = FileRetriever._from_dict(queries)
    assert retriever.file_path == "<in-memory>"
    result = retriever.retrieve("What is X?", top_k=5)
    assert result.chunks == ["chunk1", "chunk2"]

def test_from_dict_empty_raises():
    with pytest.raises(ValueError, match="empty"):
        FileRetriever._from_dict({})

def test_from_dict_missing_chunks_raises():
    with pytest.raises(ValueError):
        FileRetriever._from_dict({"Q": {"metadata": []}})
```

**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant