Add support for end to end in-memory QA evaluation by KyleZheng1284 · Pull Request #1865 · NVIDIA/NeMo-Retriever

KyleZheng1284 · 2026-04-15T23:48:22Z

Description

Added an in-memory path so you can run the full QA eval pipeline (ingest through scoring) in one command without intermediate files, alongside the existing separable export+eval workflow

Built off of this existing PR #1754

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

copy-pr-bot · 2026-04-15T23:48:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-15T23:55:10Z

Greptile Summary

Adds an end-to-end in-memory QA evaluation path: FileRetriever._from_dict() enables in-process construction from LanceDB query results, graph_pipeline.py gains an --evaluation-mode qa branch with solid pre-flight validation, and cli.py exposes retriever eval run/export/build-page-index subcommands. The new path works correctly and the README documentation is thorough.

The new _from_dict factory and the from_lancedb in-memory path have no corresponding unit tests, despite being the core of the feature. Per the test-coverage-new-code rule these should be covered before merging.
_build_env_config (~97 lines) and run_cmd (~118 lines) both exceed the 50-line SRP limit; _build_env_config in particular handles four distinct concerns that would benefit from decomposition.

Confidence Score: 4/5

Safe to merge once the missing unit tests for the in-memory retriever path are added; remaining findings are style/documentation P2s.

One P1 finding (missing tests for new FileRetriever._from_dict / from_lancedb in-memory path per test-coverage-new-code rule) prevents a 5. Logic is correct, pre-flight validation is solid, and the previously raised concerns are addressed.

nemo_retriever/src/nemo_retriever/evaluation/retrievers.py — new _from_dict factory has no tests; nemo_retriever/src/nemo_retriever/evaluation/cli.py — _build_env_config and run_cmd are oversized.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/evaluation/retrievers.py	Adds `_from_dict()` private factory and updates `from_lancedb()` to use in-memory construction; logic is correct but `from_lancedb`'s docstring lacks a `Returns` section and the new private factory has no tests.
nemo_retriever/src/nemo_retriever/evaluation/cli.py	Adds `run`, `export`, and `build-page-index` Typer subcommands; `run_cmd` (~118 lines) and `_build_env_config` (~97 lines) both exceed the 50-line SRP limit, and there are no corresponding tests.
nemo_retriever/src/nemo_retriever/evaluation/config.py	Adds heterogeneous-judge warning in `_normalize_config`; logic is correct and the previously-flagged concern about silently applying only the first judge is an existing limitation documented here.
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py	Adds pre-flight validation and QA eval path (`evaluation_mode == "qa"`); pre-flight guards for `eval_config` and file existence are solid, but the QA sweep result loop at line 762-771 does not propagate failures via exit code.
nemo_retriever/src/nemo_retriever/evaluation/README.md	Comprehensive documentation for the new end-to-end in-memory QA path, including quick-reference commands and comparison table; no code changes.

Sequence Diagram

sequenceDiagram
    participant U as User CLI
    participant GP as graph_pipeline.py
    participant FR as FileRetriever
    participant LDB as LanceDB
    participant ES as run_eval_sweep

    U->>GP: --evaluation-mode qa --eval-config ... --query-csv ...
    GP->>GP: Pre-flight: validate eval_config & query_csv exist
    GP->>GP: Ingest PDFs → result_df
    GP->>GP: Build page index (in-memory from result_df OR from --page-index file)
    GP->>FR: from_lancedb(qa_pairs, lancedb_uri, page_index, save_path)
    FR->>LDB: query_lancedb(queries, embedder, top_k)
    LDB-->>FR: all_results dict
    FR->>FR: _from_dict(all_results) → in-memory FileRetriever
    opt save_path provided
        FR->>FR: write_retrieval_json(save_path)
    end
    FR-->>GP: FileRetriever instance
    GP->>FR: check_coverage(qa_pairs)
    FR-->>GP: coverage float
    GP->>ES: run_eval_sweep(qa_cfg, qa_pairs, retriever)
    ES-->>GP: sweep_results list
    GP->>GP: Log PASS/FAIL counts + metrics

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 92-125

Comment:
**No unit tests for the new in-memory construction path**

`_from_dict()` is the core factory for the in-memory QA path and is also called by the public `from_lancedb()`. Neither path has a corresponding test. If the normalisation logic in `_from_dict` diverges from `__init__` (e.g. a future attribute is added to `__init__` but forgotten in `_from_dict`), the divergence will only surface at runtime. The `test-coverage-new-code` rule requires tests for new business logic.

Suggested additions to `nemo_retriever/tests/test_retriever_queries.py` (or a new `test_qa_eval.py`):

```python
def test_from_dict_roundtrip():
    queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}}
    retriever = FileRetriever._from_dict(queries)
    assert retriever.file_path == "<in-memory>"
    result = retriever.retrieve("What is X?", top_k=5)
    assert result.chunks == ["chunk1", "chunk2"]

def test_from_dict_empty_raises():
    with pytest.raises(ValueError, match="empty"):
        FileRetriever._from_dict({})

def test_from_dict_missing_chunks_raises():
    with pytest.raises(ValueError):
        FileRetriever._from_dict({"Q": {"metadata": []}})
```

**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py
Line: 127-181

Comment:
**`from_lancedb` docstring missing `Returns` section**

The public classmethod `from_lancedb` has Parameter docs but no `Returns` section, which is required by the `docstrings-public-interface` rule for public interfaces.

```suggestion
    ) -> "FileRetriever":
        """Query LanceDB in-memory, optionally save, return a FileRetriever.

        Reuses :func:`~nemo_retriever.export.query_lancedb` for batched
        vector search and :func:`~nemo_retriever.export.write_retrieval_json`
        for optional disk persistence.

        Parameters
        ----------
        qa_pairs : list[dict]
            Ground-truth pairs; each must have a ``"query"`` key.
        lancedb_uri : str
            Path to the LanceDB directory.
        lancedb_table : str
            LanceDB table name.
        embedder : str
            Embedding model name for query encoding.
        top_k : int
            Number of chunks to retrieve per query.
        page_index : dict, optional
            ``{source_id: {page_str: markdown}}``.  Enables full-page
            markdown expansion when provided.
        save_path : str, optional
            If set, also writes the retrieval JSON to this path so it
            can be reloaded later via ``FileRetriever(file_path=...)``.

        Returns
        -------
        FileRetriever
            In-memory retriever populated from LanceDB results.
            ``file_path`` is set to ``save_path`` when provided,
            otherwise ``"<in-memory>"``.
        """
```

**Rule Used:** Public modules, classes, and functions must have d... ([source](https://app.greptile.com/review/custom-context?memory=docstrings-public-interface))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/cli.py
Line: 123-219

Comment:
**Functions exceed 50-line SRP limit**

`_build_env_config` is ~97 lines and `run_cmd` below is ~118 lines — both well over the 50-line guideline. `_build_env_config` handles env parsing, model list parsing, key resolution, and config assembly: four distinct responsibilities. Consider splitting it into helpers like `_parse_gen_model_pairs(gen_models_str, ...)`, `_build_retrieval_block(...)`, and `_build_models_and_evals(...)` so each piece can be tested and read independently.

**Rule Used:** Functions and classes should have a single, well-d... ([source](https://app.greptile.com/review/custom-context?memory=single-responsibility))

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (4): Last reviewed commit: "Add pre-flight validation and lancedb gu..." | Re-trigger Greptile}

greptile-apps · 2026-04-15T23:55:14Z

+            passed = sum(1 for r in sweep_results if r["status"] == "PASS")
+            logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))
+            for r in sweep_results:
+                if r["status"] == "PASS":
+                    out = Path(r["output_path"]).resolve()
+                    logger.info("Results: %s", out)
+                er = r.get("eval_results", {})
+                judge_scores = er.get("tier3_llm_judge", {})
+                for gen_name, stats in judge_scores.items():
+                    evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))


QA sweep failures do not affect process exit code

When evaluation_mode == "qa", all sweep FAIL results are logged but the script exits with code 0. The cli.py counterpart raises Exit(code=1) when any run fails (if passed < len(results): raise typer.Exit(code=1)). Any CI job using --evaluation-mode qa for end-to-end validation will report success even if every evaluation run failed (e.g., missing API key, unreachable model).

Suggested change

passed = sum(1 for r in sweep_results if r["status"] == "PASS")

logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))

for r in sweep_results:

if r["status"] == "PASS":

out = Path(r["output_path"]).resolve()

logger.info("Results: %s", out)

er = r.get("eval_results", {})

judge_scores = er.get("tier3_llm_judge", {})

for gen_name, stats in judge_scores.items():

evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))

passed = sum(1 for r in sweep_results if r["status"] == "PASS")

logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results))

for r in sweep_results:

if r["status"] == "PASS":

out = Path(r["output_path"]).resolve()

logger.info("Results: %s", out)

else:

logger.error("QA run FAILED: %s: %s", r.get("label", ""), r.get("error", ""))

er = r.get("eval_results", {})

judge_scores = er.get("tier3_llm_judge", {})

for gen_name, stats in judge_scores.items():

evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0))

if passed < len(sweep_results):

raise ValueError(

f"QA evaluation failed: {passed}/{len(sweep_results)} runs passed. "

"Check the logs above for details."

)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py Line: 698-707 Comment: **QA sweep failures do not affect process exit code** When `evaluation_mode == "qa"`, all sweep `FAIL` results are logged but the script exits with code 0. The `cli.py` counterpart raises `Exit(code=1)` when any run fails (`if passed < len(results): raise typer.Exit(code=1)`). Any CI job using `--evaluation-mode qa` for end-to-end validation will report success even if every evaluation run failed (e.g., missing API key, unreachable model). ```suggestion passed = sum(1 for r in sweep_results if r["status"] == "PASS") logger.info("QA sweep complete: %d/%d passed", passed, len(sweep_results)) for r in sweep_results: if r["status"] == "PASS": out = Path(r["output_path"]).resolve() logger.info("Results: %s", out) else: logger.error("QA run FAILED: %s: %s", r.get("label", ""), r.get("error", "")) er = r.get("eval_results", {}) judge_scores = er.get("tier3_llm_judge", {}) for gen_name, stats in judge_scores.items(): evaluation_metrics[f"{gen_name} ({r['label']})"] = float(stats.get("mean_score", 0.0)) if passed < len(sweep_results): raise ValueError( f"QA evaluation failed: {passed}/{len(sweep_results)} runs passed. " "Check the logs above for details." ) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T23:55:15Z

+            # _client and _model_name are captured via default args to avoid
+            # the late-binding closure bug across loop iterations. self.judge
+            # is safe to access directly since it does not change per-iteration.
+            def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name):


Captured default arg _model_name is never used inside _process_row

_model_name is bound via a default argument to avoid the late-binding closure bug, but the function body never references it. Only _client is actually needed. This is dead code that adds noise without guarding anything.

Suggested change

def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name):

def _process_row(row_tuple, row_aic, _client=client):

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py Line: 154 Comment: **Captured default arg `_model_name` is never used inside `_process_row`** `_model_name` is bound via a default argument to avoid the late-binding closure bug, but the function body never references it. Only `_client` is actually needed. This is dead code that adds noise without guarding anything. ```suggestion def _process_row(row_tuple, row_aic, _client=client): ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T23:55:16Z

+    if dataframe is not None:
+        df = dataframe
+    else:
+        parquet_path = Path(parquet_dir)
+        if not parquet_path.is_dir():
+            raise FileNotFoundError(f"Parquet directory not found: {parquet_path}")
+        single = parquet_path / "extraction.parquet"
+        if single.is_file():
+            parquet_files = [single]
+        else:
+            parquet_files = sorted(f for f in parquet_path.rglob("*.parquet") if f.name != "extraction.parquet")
+        if not parquet_files:
+            raise FileNotFoundError(f"No .parquet files in {parquet_path}")
+
+        dfs = [_read_parquet_for_markdown(f) for f in parquet_files]
+        df = pd.concat(dfs, ignore_index=True)
+
+    path_col = "path" if "path" in df.columns else "source_id"
+    if path_col not in df.columns:
+        raise KeyError(f"Neither 'path' nor 'source_id' found in columns: {list(df.columns)}")
+
+    list_keys = ("tables", "table", "charts", "chart", "infographics", "infographic")
+
+    docs_grouped: dict[str, list[dict]] = defaultdict(list)
+    for _, row in df.iterrows():
+        source = str(row.get(path_col, ""))
+        if not source:
+            continue
+        record = row.to_dict()
+        for key in list_keys:
+            val = record.get(key)
+            if isinstance(val, np.ndarray):
+                record[key] = val.tolist()
+        docs_grouped[source].append(record)


Large-column filtering not applied on the dataframe= path

_read_parquet_for_markdown explicitly restricts columns to _MARKDOWN_PARQUET_COLUMNS and its docstring says this "avoids multi-GB memory spikes" from embedding vectors and page images. When build_page_index is called with dataframe=result_df from graph_pipeline.py, the full ingestion DataFrame (with embeddings, base64 page images, etc.) flows through unchanged — every row.to_dict() creates a record dict carrying all columns.

For the 767-PDF use case (~84k rows), applying the same column selection before iterating would keep memory consistent with the Parquet path:

if dataframe is not None: available = set(dataframe.columns) cols = sorted(_MARKDOWN_PARQUET_COLUMNS & available) df = dataframe[cols] if cols else dataframe else: ...

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/io/markdown.py Line: 363-396 Comment: **Large-column filtering not applied on the `dataframe=` path** `_read_parquet_for_markdown` explicitly restricts columns to `_MARKDOWN_PARQUET_COLUMNS` and its docstring says this "avoids multi-GB memory spikes" from embedding vectors and page images. When `build_page_index` is called with `dataframe=result_df` from `graph_pipeline.py`, the full ingestion DataFrame (with embeddings, base64 page images, etc.) flows through unchanged — every `row.to_dict()` creates a record dict carrying all columns. For the 767-PDF use case (~84k rows), applying the same column selection before iterating would keep memory consistent with the Parquet path: ```python if dataframe is not None: available = set(dataframe.columns) cols = sorted(_MARKDOWN_PARQUET_COLUMNS & available) df = dataframe[cols] if cols else dataframe else: ... ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T23:55:17Z

+                "clients for heterogeneous judges.",
+                len(distinct_judges),
+                sorted(distinct_judges),
+                first_judge_key,
+            )
+        config.setdefault("judge", models[first_judge_key])
+
+    elif "generators" in config and "judge" in config:
+        generators = config["generators"]
+        judge_cfg = config["judge"]
+        judge_name = judge_cfg.get("name", "judge")
+        config.setdefault("models", {})


Heterogeneous judges silently collapsed to first judge for build_eval_chain/build_eval_pipeline

When the new models/evaluations format specifies different judges for different evaluations, config["judge"] is set to only the first evaluation's judge. run_eval_sweep is unaffected (it reads models[judge_name] per evaluation), but build_eval_chain and build_eval_pipeline — which both read config["judge"] — will silently apply only the first judge across all evaluations. The logger.warning fires, but the result is wrong rather than an error.

Consider raising a ValueError instead of a warning when build_eval_chain/build_eval_pipeline are called with a config that has multiple distinct judges, so callers know to switch to run_eval_sweep.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/evaluation/config.py Line: 146-157 Comment: **Heterogeneous judges silently collapsed to first judge for `build_eval_chain`/`build_eval_pipeline`** When the new `models`/`evaluations` format specifies different judges for different evaluations, `config["judge"]` is set to only the first evaluation's judge. `run_eval_sweep` is unaffected (it reads `models[judge_name]` per evaluation), but `build_eval_chain` and `build_eval_pipeline` — which both read `config["judge"]` — will silently apply only the first judge across all evaluations. The `logger.warning` fires, but the result is wrong rather than an error. Consider raising a `ValueError` instead of a warning when `build_eval_chain`/`build_eval_pipeline` are called with a config that has multiple distinct judges, so callers know to switch to `run_eval_sweep`. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-16T20:46:18Z

+                else:
+                    logger.info("Built page index: %d documents", len(page_idx))
+
+            qa_cfg = load_eval_config(str(eval_config))


eval_config=None produces confusing error

When --evaluation-mode qa is used without --eval-config, eval_config is None (its default). str(None) is "None", so load_eval_config("None") runs and raises FileNotFoundError: Config file not found: None. The user has no idea they need to supply --eval-config.

Add a guard before line 729:

Suggested change

qa_cfg = load_eval_config(str(eval_config))

if eval_config is None:

logger.error("--eval-config is required when --evaluation-mode=qa")

raise typer.Exit(code=1)

qa_cfg = load_eval_config(str(eval_config))

Rule Used: Error messages must tell the user what happened AN... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py Line: 729 Comment: **`eval_config=None` produces confusing error** When `--evaluation-mode qa` is used without `--eval-config`, `eval_config` is `None` (its default). `str(None)` is `"None"`, so `load_eval_config("None")` runs and raises `FileNotFoundError: Config file not found: None`. The user has no idea they need to supply `--eval-config`. Add a guard before line 729: ```suggestion if eval_config is None: logger.error("--eval-config is required when --evaluation-mode=qa") raise typer.Exit(code=1) qa_cfg = load_eval_config(str(eval_config)) ``` **Rule Used:** Error messages must tell the user what happened AN... ([source](https://app.greptile.com/review/custom-context?memory=actionable-errors)) How can I resolve this? If you propose a fix, please make it concise.

The eval CLI (cli.py) already aborts when retrieval coverage falls below the configured min_coverage. The graph_pipeline QA path logged coverage but never enforced the threshold, creating inconsistent behavior between the two entry points. Made-with: Cursor

- build_eval_chain() now raises a clear ValueError when retrieval.type is not 'file', matching build_eval_pipeline(). - graph_pipeline validates eval_config and query_csv file existence before starting ingestion so a typo does not waste a full run. Made-with: Cursor

greptile-apps · 2026-04-16T21:18:21Z

+    @classmethod
+    def _from_dict(cls, queries: dict[str, dict]) -> "FileRetriever":
+        """Build a FileRetriever from an in-memory queries dict.
+
+        Bypasses file I/O while reusing the same normalized index that
+        ``__init__`` builds from JSON.  All instance methods (``retrieve``,
+        ``check_coverage``) work identically afterwards.
+
+        Parameters
+        ----------
+        queries : dict
+            ``{query_text: {"chunks": [...], "metadata": [...]}}`` --
+            the same shape as the ``"queries"`` value in a retrieval JSON.
+        """
+        if not queries:
+            raise ValueError("FileRetriever._from_dict: queries dict is empty")
+        sample = next(iter(queries.values()), {})
+        if not isinstance(sample.get("chunks"), list):
+            raise ValueError(
+                "FileRetriever._from_dict: first entry is missing a 'chunks' list. "
+                'Expected: {"query": {"chunks": ["..."]}}'
+            )
+
+        instance = object.__new__(cls)
+        instance.file_path = "<in-memory>"
+        instance._norm_index = {}
+        instance._raw_keys = {}
+        instance._miss_count = 0
+        instance._miss_lock = threading.Lock()
+        for raw_key, value in queries.items():
+            norm = _normalize_query(raw_key)
+            instance._norm_index[norm] = value
+            instance._raw_keys[norm] = raw_key
+        return instance


No unit tests for the new in-memory construction path

_from_dict() is the core factory for the in-memory QA path and is also called by the public from_lancedb(). Neither path has a corresponding test. If the normalisation logic in _from_dict diverges from __init__ (e.g. a future attribute is added to __init__ but forgotten in _from_dict), the divergence will only surface at runtime. The test-coverage-new-code rule requires tests for new business logic.

Suggested additions to nemo_retriever/tests/test_retriever_queries.py (or a new test_qa_eval.py):

def test_from_dict_roundtrip(): queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}} retriever = FileRetriever._from_dict(queries) assert retriever.file_path == "<in-memory>" result = retriever.retrieve("What is X?", top_k=5) assert result.chunks == ["chunk1", "chunk2"] def test_from_dict_empty_raises(): with pytest.raises(ValueError, match="empty"): FileRetriever._from_dict({}) def test_from_dict_missing_chunks_raises(): with pytest.raises(ValueError): FileRetriever._from_dict({"Q": {"metadata": []}})

Rule Used: New functionality must include corresponding unit ... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/evaluation/retrievers.py Line: 92-125 Comment: **No unit tests for the new in-memory construction path** `_from_dict()` is the core factory for the in-memory QA path and is also called by the public `from_lancedb()`. Neither path has a corresponding test. If the normalisation logic in `_from_dict` diverges from `__init__` (e.g. a future attribute is added to `__init__` but forgotten in `_from_dict`), the divergence will only surface at runtime. The `test-coverage-new-code` rule requires tests for new business logic. Suggested additions to `nemo_retriever/tests/test_retriever_queries.py` (or a new `test_qa_eval.py`): ```python def test_from_dict_roundtrip(): queries = {"What is X?": {"chunks": ["chunk1", "chunk2"], "metadata": []}} retriever = FileRetriever._from_dict(queries) assert retriever.file_path == "<in-memory>" result = retriever.retrieve("What is X?", top_k=5) assert result.chunks == ["chunk1", "chunk2"] def test_from_dict_empty_raises(): with pytest.raises(ValueError, match="empty"): FileRetriever._from_dict({}) def test_from_dict_missing_chunks_raises(): with pytest.raises(ValueError): FileRetriever._from_dict({"Q": {"metadata": []}}) ``` **Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code)) How can I resolve this? If you propose a fix, please make it concise.

KyleZheng1284 requested review from a team as code owners April 15, 2026 23:48

KyleZheng1284 requested a review from charlesbluca April 15, 2026 23:48

greptile-apps bot reviewed Apr 15, 2026

View reviewed changes

add support for full e2e qa run from one cli command

37a4d53

KyleZheng1284 force-pushed the full-e2e-eval branch from 7ce7eba to 37a4d53 Compare April 16, 2026 20:42

greptile-apps bot reviewed Apr 16, 2026

View reviewed changes

Kyle Zheng added 2 commits April 16, 2026 20:56

greptile-apps bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for end to end in-memory QA evaluation#1865

Add support for end to end in-memory QA evaluation#1865
KyleZheng1284 wants to merge 3 commits intoNVIDIA:mainfrom
KyleZheng1284:full-e2e-eval

KyleZheng1284 commented Apr 15, 2026

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

greptile-apps bot commented Apr 15, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 16, 2026

Uh oh!

greptile-apps bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def _process_row(row_tuple, row_aic, _client=client, _model_name=model_name):
	def _process_row(row_tuple, row_aic, _client=client):

-            qa_cfg = load_eval_config(str(eval_config))
+            if eval_config is None:
+                logger.error("--eval-config is required when --evaluation-mode=qa")
+                raise typer.Exit(code=1)
+            qa_cfg = load_eval_config(str(eval_config))

Conversation

KyleZheng1284 commented Apr 15, 2026

Description

Checklist

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

greptile-apps bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Apr 15, 2026 •

edited

Loading