Add support for end to end in-memory QA evaluation#1865
Add support for end to end in-memory QA evaluation#1865KyleZheng1284 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryAdds an end-to-end in-memory QA evaluation path:
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/evaluation/retrievers.py | Adds _from_dict() private factory and updates from_lancedb() to use in-memory construction; logic is correct but from_lancedb's docstring lacks a Returns section and the new private factory has no tests. |
| nemo_retriever/src/nemo_retriever/evaluation/cli.py | Adds run, export, and build-page-index Typer subcommands; run_cmd (~118 lines) and _build_env_config (~97 lines) both exceed the 50-line SRP limit, and there are no corresponding tests. |
| nemo_retriever/src/nemo_retriever/evaluation/config.py | Adds heterogeneous-judge warning in _normalize_config; logic is correct and the previously-flagged concern about silently applying only the first judge is an existing limitation documented here. |
| nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py | Adds pre-flight validation and QA eval path (evaluation_mode == "qa"); pre-flight guards for eval_config and file existence are solid, but the QA sweep result loop at line 762-771 does not propagate failures via exit code. |
| nemo_retriever/src/nemo_retriever/evaluation/README.md | Comprehensive documentation for the new end-to-end in-memory QA path, including quick-reference commands and comparison table; no code changes. |
Sequence Diagram
sequenceDiagram
participant U as User CLI
participant GP as graph_pipeline.py
participant FR as FileRetriever
participant LDB as LanceDB
participant ES as run_eval_sweep
U->>GP: --evaluation-mode qa --eval-config ... --query-csv ...
GP->>GP: Pre-flight: validate eval_config & query_csv exist
GP->>GP: Ingest PDFs → result_df
GP->>GP: Build page index (in-memory from result_df OR from --page-index file)
GP->>FR: from_lancedb(qa_pairs, lancedb_uri, page_index, save_path)
FR->>LDB: query_lancedb(queries, embedder, top_k)
LDB-->>FR: all_results dict
FR->>FR: _from_dict(all_results) → in-memory FileRetriever
opt save_path provided
FR->>FR: write_retrieval_json(save_path)
end
FR-->>GP: FileRetriever instance
GP->>FR: check_coverage(qa_pairs)
FR-->>GP: coverage float
GP->>ES: run_eval_sweep(qa_cfg, qa_pairs, retriever)
ES-->>GP: sweep_results list
GP->>GP: Log PASS/FAIL counts + metrics
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 92-125
Comment:
**No unit tests for the new in-memory construction path**
`_from_dict()` is the core factory for the in-memory QA path and is also called by the public `from_lancedb()`. Neither path has a corresponding test. If the normalisation logic in `_from_dict` diverges from `__init__` (e.g. a future attribute is added to `__init__` but forgotten in `_from_dict`), the divergence will only surface at runtime. The `test-coverage-new-code` rule requires tests for new business logic.
Suggested additions to `nemo_retriever/tests/test_retriever_queries.py` (or a new `test_qa_eval.py`):
```python
def test_from_dict_roundtrip():
queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
retriever = FileRetriever._from_dict(queries)
assert retriever.file_path == "<in-memory>"
result = retriever.retrieve("What is X?", top_k=5)
assert result.chunks == ["chunk1", "chunk2"]
def test_from_dict_empty_raises():
with pytest.raises(ValueError, match="empty"):
FileRetriever._from_dict({})
def test_from_dict_missing_chunks_raises():
with pytest.raises(ValueError):
FileRetriever._from_dict({"Q": {"metadata": []}})
```
**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 127-181
Comment:
**`from_lancedb` docstring missing `Returns` section**
The public classmethod `from_lancedb` has Parameter docs but no `Returns` section, which is required by the `docstrings-public-interface` rule for public interfaces.
```suggestion
) -> "FileRetriever":
"""Query LanceDB in-memory, optionally save, return a FileRetriever.
Reuses :func:`~nemo_retriever.export.query_lancedb` for batched
vector search and :func:`~nemo_retriever.export.write_retrieval_json`
for optional disk persistence.
Parameters
----------
qa_pairs : list[dict]
Ground-truth pairs; each must have a ``"query"`` key.
lancedb_uri : str
Path to the LanceDB directory.
lancedb_table : str
LanceDB table name.
embedder : str
Embedding model name for query encoding.
top_k : int
Number of chunks to retrieve per query.
page_index : dict, optional
``{source_id: {page_str: markdown}}``. Enables full-page
markdown expansion when provided.
save_path : str, optional
If set, also writes the retrieval JSON to this path so it
can be reloaded later via ``FileRetriever(file_path=...)``.
Returns
-------
FileRetriever
In-memory retriever populated from LanceDB results.
``file_path`` is set to ``save_path`` when provided,
otherwise ``"<in-memory>"``.
"""
```
**Rule Used:** Public modules, classes, and functions must have d... ([source](https://app.greptile.com/review/custom-context?memory=docstrings-public-interface))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/cli.py
Line: 123-219
Comment:
**Functions exceed 50-line SRP limit**
`_build_env_config` is ~97 lines and `run_cmd` below is ~118 lines — both well over the 50-line guideline. `_build_env_config` handles env parsing, model list parsing, key resolution, and config assembly: four distinct responsibilities. Consider splitting it into helpers like `_parse_gen_model_pairs(gen_models_str, ...)`, `_build_retrieval_block(...)`, and `_build_models_and_evals(...)` so each piece can be tested and read independently.
**Rule Used:** Functions and classes should have a single, well-d... ([source](https://app.greptile.com/review/custom-context?memory=single-responsibility))
How can I resolve this? If you propose a fix, please make it concise.Reviews (4): Last reviewed commit: "Add pre-flight validation and lancedb gu..." | Re-trigger Greptile
| passed = sum(1 for r in sweep_results if r["status"] == "PASS") | ||
| logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results)) | ||
| for r in sweep_results: | ||
| if r["status"] == "PASS": | ||
| out = Path(r["output_path"]).resolve() | ||
| logger.info("Results: %s", out) | ||
| er = r.get("eval_results", {}) | ||
| judge_scores = er.get("tier3_llm_judge", {}) | ||
| for gen_name, stats in judge_scores.items(): | ||
| evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0)) |
There was a problem hiding this comment.
QA sweep failures do not affect process exit code
When evaluation_mode == "qa", all sweep FAIL results are logged but the script exits with code 0. The cli.py counterpart raises Exit(code=1) when any run fails (if passed < len(results): raise typer.Exit(code=1)). Any CI job using --evaluation-mode qa for end-to-end validation will report success even if every evaluation run failed (e.g., missing API key, unreachable model).
| passed = sum(1 for r in sweep_results if r["status"] == "PASS") | |
| logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results)) | |
| for r in sweep_results: | |
| if r["status"] == "PASS": | |
| out = Path(r["output_path"]).resolve() | |
| logger.info("Results: %s", out) | |
| er = r.get("eval_results", {}) | |
| judge_scores = er.get("tier3_llm_judge", {}) | |
| for gen_name, stats in judge_scores.items(): | |
| evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0)) | |
| passed = sum(1 for r in sweep_results if r["status"] == "PASS") | |
| logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results)) | |
| for r in sweep_results: | |
| if r["status"] == "PASS": | |
| out = Path(r["output_path"]).resolve() | |
| logger.info("Results: %s", out) | |
| else: | |
| logger.error("QA run FAILED: %s: %s", r.get("label", ""), r.get("error", "")) | |
| er = r.get("eval_results", {}) | |
| judge_scores = er.get("tier3_llm_judge", {}) | |
| for gen_name, stats in judge_scores.items(): | |
| evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0)) | |
| if passed < len(sweep_results): | |
| raise ValueError( | |
| f"QA evaluation failed: {passed}/{len(sweep_results)} runs passed. " | |
| "Check the logs above for details." | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py
Line: 698-707
Comment:
**QA sweep failures do not affect process exit code**
When `evaluation_mode == "qa"`, all sweep `FAIL` results are logged but the script exits with code 0. The `cli.py` counterpart raises `Exit(code=1)` when any run fails (`if passed < len(results): raise typer.Exit(code=1)`). Any CI job using `--evaluation-mode qa` for end-to-end validation will report success even if every evaluation run failed (e.g., missing API key, unreachable model).
```suggestion
passed = sum(1 for r in sweep_results if r["status"] == "PASS")
logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))
for r in sweep_results:
if r["status"] == "PASS":
out = Path(r["output_path"]).resolve()
logger.info("Results: %s", out)
else:
logger.error("QA run FAILED: %s: %s", r.get("label", ""), r.get("error", ""))
er = r.get("eval_results", {})
judge_scores = er.get("tier3_llm_judge", {})
for gen_name, stats in judge_scores.items():
evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))
if passed < len(sweep_results):
raise ValueError(
f"QA evaluation failed: {passed}/{len(sweep_results)} runs passed. "
"Check the logs above for details."
)
```
How can I resolve this? If you propose a fix, please make it concise.| # _client and _model_name are captured via default args to avoid | ||
| # the late-binding closure bug across loop iterations. self.judge | ||
| # is safe to access directly since it does not change per-iteration. | ||
| def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name): |
There was a problem hiding this comment.
Captured default arg
_model_name is never used inside _process_row
_model_name is bound via a default argument to avoid the late-binding closure bug, but the function body never references it. Only _client is actually needed. This is dead code that adds noise without guarding anything.
| def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name): | |
| def _process_row(row_tuple, row_aic, _client=client): |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py
Line: 154
Comment:
**Captured default arg `_model_name` is never used inside `_process_row`**
`_model_name` is bound via a default argument to avoid the late-binding closure bug, but the function body never references it. Only `_client` is actually needed. This is dead code that adds noise without guarding anything.
```suggestion
def _process_row(row_tuple, row_aic, _client=client):
```
How can I resolve this? If you propose a fix, please make it concise.| if dataframe is not None: | ||
| df = dataframe | ||
| else: | ||
| parquet_path = Path(parquet_dir) | ||
| if not parquet_path.is_dir(): | ||
| raise FileNotFoundError(f"Parquet directory not found: {parquet_path}") | ||
| single = parquet_path / "extraction.parquet" | ||
| if single.is_file(): | ||
| parquet_files = [single] | ||
| else: | ||
| parquet_files = sorted(f for f in parquet_path.rglob("*.parquet") if f.name != "extraction.parquet") | ||
| if not parquet_files: | ||
| raise FileNotFoundError(f"No .parquet files in {parquet_path}") | ||
|
|
||
| dfs = [_read_parquet_for_markdown(f) for f in parquet_files] | ||
| df = pd.concat(dfs, ignore_index=True) | ||
|
|
||
| path_col = "path" if "path" in df.columns else "source_id" | ||
| if path_col not in df.columns: | ||
| raise KeyError(f"Neither 'path' nor 'source_id' found in columns: {list(df.columns)}") | ||
|
|
||
| list_keys = ("tables", "table", "charts", "chart", "infographics", "infographic") | ||
|
|
||
| docs_grouped: dict[str, list[dict]] = defaultdict(list) | ||
| for _, row in df.iterrows(): | ||
| source = str(row.get(path_col, "")) | ||
| if not source: | ||
| continue | ||
| record = row.to_dict() | ||
| for key in list_keys: | ||
| val = record.get(key) | ||
| if isinstance(val, np.ndarray): | ||
| record[key] = val.tolist() | ||
| docs_grouped[source].append(record) |
There was a problem hiding this comment.
Large-column filtering not applied on the
dataframe= path
_read_parquet_for_markdown explicitly restricts columns to _MARKDOWN_PARQUET_COLUMNS and its docstring says this "avoids multi-GB memory spikes" from embedding vectors and page images. When build_page_index is called with dataframe=result_df from graph_pipeline.py, the full ingestion DataFrame (with embeddings, base64 page images, etc.) flows through unchanged — every row.to_dict() creates a record dict carrying all columns.
For the 767-PDF use case (~84k rows), applying the same column selection before iterating would keep memory consistent with the Parquet path:
if dataframe is not None:
available = set(dataframe.columns)
cols = sorted(_MARKDOWN_PARQUET_COLUMNS & available)
df = dataframe[cols] if cols else dataframe
else:
...Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 363-396
Comment:
**Large-column filtering not applied on the `dataframe=` path**
`_read_parquet_for_markdown` explicitly restricts columns to `_MARKDOWN_PARQUET_COLUMNS` and its docstring says this "avoids multi-GB memory spikes" from embedding vectors and page images. When `build_page_index` is called with `dataframe=result_df` from `graph_pipeline.py`, the full ingestion DataFrame (with embeddings, base64 page images, etc.) flows through unchanged — every `row.to_dict()` creates a record dict carrying all columns.
For the 767-PDF use case (~84k rows), applying the same column selection before iterating would keep memory consistent with the Parquet path:
```python
if dataframe is not None:
available = set(dataframe.columns)
cols = sorted(_MARKDOWN_PARQUET_COLUMNS & available)
df = dataframe[cols] if cols else dataframe
else:
...
```
How can I resolve this? If you propose a fix, please make it concise.| "clients for heterogeneous judges.", | ||
| len(distinct_judges), | ||
| sorted(distinct_judges), | ||
| first_judge_key, | ||
| ) | ||
| config.setdefault("judge", models[first_judge_key]) | ||
|
|
||
| elif "generators" in config and "judge" in config: | ||
| generators = config["generators"] | ||
| judge_cfg = config["judge"] | ||
| judge_name = judge_cfg.get("name", "judge") | ||
| config.setdefault("models", {}) |
There was a problem hiding this comment.
Heterogeneous judges silently collapsed to first judge for
build_eval_chain/build_eval_pipeline
When the new models/evaluations format specifies different judges for different evaluations, config["judge"] is set to only the first evaluation's judge. run_eval_sweep is unaffected (it reads models[judge_name] per evaluation), but build_eval_chain and build_eval_pipeline — which both read config["judge"] — will silently apply only the first judge across all evaluations. The logger.warning fires, but the result is wrong rather than an error.
Consider raising a ValueError instead of a warning when build_eval_chain/build_eval_pipeline are called with a config that has multiple distinct judges, so callers know to switch to run_eval_sweep.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/config.py
Line: 146-157
Comment:
**Heterogeneous judges silently collapsed to first judge for `build_eval_chain`/`build_eval_pipeline`**
When the new `models`/`evaluations` format specifies different judges for different evaluations, `config["judge"]` is set to only the first evaluation's judge. `run_eval_sweep` is unaffected (it reads `models[judge_name]` per evaluation), but `build_eval_chain` and `build_eval_pipeline` — which both read `config["judge"]` — will silently apply only the first judge across all evaluations. The `logger.warning` fires, but the result is wrong rather than an error.
Consider raising a `ValueError` instead of a warning when `build_eval_chain`/`build_eval_pipeline` are called with a config that has multiple distinct judges, so callers know to switch to `run_eval_sweep`.
How can I resolve this? If you propose a fix, please make it concise.7ce7eba to
37a4d53
Compare
| else: | ||
| logger.info("Built page index: %d documents", len(page_idx)) | ||
|
|
||
| qa_cfg = load_eval_config(str(eval_config)) |
There was a problem hiding this comment.
eval_config=None produces confusing error
When --evaluation-mode qa is used without --eval-config, eval_config is None (its default). str(None) is "None", so load_eval_config("None") runs and raises FileNotFoundError: Config file not found: None. The user has no idea they need to supply --eval-config.
Add a guard before line 729:
| qa_cfg = load_eval_config(str(eval_config)) | |
| if eval_config is None: | |
| logger.error("--eval-config is required when --evaluation-mode=qa") | |
| raise typer.Exit(code=1) | |
| qa_cfg = load_eval_config(str(eval_config)) |
Rule Used: Error messages must tell the user what happened AN... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py
Line: 729
Comment:
**`eval_config=None` produces confusing error**
When `--evaluation-mode qa` is used without `--eval-config`, `eval_config` is `None` (its default). `str(None)` is `"None"`, so `load_eval_config("None")` runs and raises `FileNotFoundError: Config file not found: None`. The user has no idea they need to supply `--eval-config`.
Add a guard before line 729:
```suggestion
if eval_config is None:
logger.error("--eval-config is required when --evaluation-mode=qa")
raise typer.Exit(code=1)
qa_cfg = load_eval_config(str(eval_config))
```
**Rule Used:** Error messages must tell the user what happened AN... ([source](https://app.greptile.com/review/custom-context?memory=actionable-errors))
How can I resolve this? If you propose a fix, please make it concise.The eval CLI (cli.py) already aborts when retrieval coverage falls below the configured min_coverage. The graph_pipeline QA path logged coverage but never enforced the threshold, creating inconsistent behavior between the two entry points. Made-with: Cursor
- build_eval_chain() now raises a clear ValueError when retrieval.type is not 'file', matching build_eval_pipeline(). - graph_pipeline validates eval_config and query_csv file existence before starting ingestion so a typo does not waste a full run. Made-with: Cursor
| @classmethod | ||
| def _from_dict(cls, queries: dict[str, dict]) -> "FileRetriever": | ||
| """Build a FileRetriever from an in-memory queries dict. | ||
|
|
||
| Bypasses file I/O while reusing the same normalized index that | ||
| ``__init__`` builds from JSON. All instance methods (``retrieve``, | ||
| ``check_coverage``) work identically afterwards. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| queries : dict | ||
| ``{query_text: {"chunks": [...], "metadata": [...]}}`` -- | ||
| the same shape as the ``"queries"`` value in a retrieval JSON. | ||
| """ | ||
| if not queries: | ||
| raise ValueError("FileRetriever._from_dict: queries dict is empty") | ||
| sample = next(iter(queries.values()), {}) | ||
| if not isinstance(sample.get("chunks"), list): | ||
| raise ValueError( | ||
| "FileRetriever._from_dict: first entry is missing a 'chunks' list. " | ||
| 'Expected: {"query": {"chunks": ["..."]}}' | ||
| ) | ||
|
|
||
| instance = object.__new__(cls) | ||
| instance.file_path = "<in-memory>" | ||
| instance._norm_index = {} | ||
| instance._raw_keys = {} | ||
| instance._miss_count = 0 | ||
| instance._miss_lock = threading.Lock() | ||
| for raw_key, value in queries.items(): | ||
| norm = _normalize_query(raw_key) | ||
| instance._norm_index[norm] = value | ||
| instance._raw_keys[norm] = raw_key | ||
| return instance |
There was a problem hiding this comment.
No unit tests for the new in-memory construction path
_from_dict() is the core factory for the in-memory QA path and is also called by the public from_lancedb(). Neither path has a corresponding test. If the normalisation logic in _from_dict diverges from __init__ (e.g. a future attribute is added to __init__ but forgotten in _from_dict), the divergence will only surface at runtime. The test-coverage-new-code rule requires tests for new business logic.
Suggested additions to nemo_retriever/tests/test_retriever_queries.py (or a new test_qa_eval.py):
def test_from_dict_roundtrip():
queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
retriever = FileRetriever._from_dict(queries)
assert retriever.file_path == "<in-memory>"
result = retriever.retrieve("What is X?", top_k=5)
assert result.chunks == ["chunk1", "chunk2"]
def test_from_dict_empty_raises():
with pytest.raises(ValueError, match="empty"):
FileRetriever._from_dict({})
def test_from_dict_missing_chunks_raises():
with pytest.raises(ValueError):
FileRetriever._from_dict({"Q": {"metadata": []}})Rule Used: New functionality must include corresponding unit ... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 92-125
Comment:
**No unit tests for the new in-memory construction path**
`_from_dict()` is the core factory for the in-memory QA path and is also called by the public `from_lancedb()`. Neither path has a corresponding test. If the normalisation logic in `_from_dict` diverges from `__init__` (e.g. a future attribute is added to `__init__` but forgotten in `_from_dict`), the divergence will only surface at runtime. The `test-coverage-new-code` rule requires tests for new business logic.
Suggested additions to `nemo_retriever/tests/test_retriever_queries.py` (or a new `test_qa_eval.py`):
```python
def test_from_dict_roundtrip():
queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
retriever = FileRetriever._from_dict(queries)
assert retriever.file_path == "<in-memory>"
result = retriever.retrieve("What is X?", top_k=5)
assert result.chunks == ["chunk1", "chunk2"]
def test_from_dict_empty_raises():
with pytest.raises(ValueError, match="empty"):
FileRetriever._from_dict({})
def test_from_dict_missing_chunks_raises():
with pytest.raises(ValueError):
FileRetriever._from_dict({"Q": {"metadata": []}})
```
**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))
How can I resolve this? If you propose a fix, please make it concise.
Description
Added an in-memory path so you can run the full QA eval pipeline (ingest through scoring) in one command without intermediate files, alongside the existing separable export+eval workflow
Built off of this existing PR #1754
Checklist