Fix OKP Solr RAG URL enrichment by mwcz · Pull Request #1561 · lightspeed-core/lightspeed-stack

mwcz · 2026-04-21T14:41:21Z

This is an alternative to #1431

Summary

Fixes some areas where the code was confused about the distinction between RH_SERVER_OKP URL and the Solr RAG URL. RH_SERVER_OKP is meant to be the base URL of the OKP instance, and reference links can be joined onto it (ex: $(RH_SERVER_OKP)/articles/1234), and the Solr URL can also be joined onto it (ex: $(RH_SERVER_OKP)/solr).
Replace manual string concatenation for URL construction with urllib.parse.urljoin in both llama_stack_configuration.py (solr URL from rhokp_url) and utils/vector_search.py (OKP document URLs).
Removes custom logic for stripping/appending slashes and handling absolute URLs, since urljoin handles all of that correctly.

Test plan

Verify solr URL is correctly constructed from rhokp_url config
Verify OKP document URLs are correctly built from base URL + reference path
Run uv run make test-unit to confirm no regressions

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Improved URL construction logic in internal components for better standardization and maintainability.

coderabbitai · 2026-04-21T14:41:37Z

Walkthrough

The changes refactor URL construction logic in two configuration and utility functions to use urllib.parse.urljoin instead of manual string concatenation. The Solr base URL in configuration enrichment and the document URL joining in vector search utilities are updated to leverage URL joining semantics.

Changes

Cohort / File(s)	Summary
URL Construction Refactoring `src/llama_stack_configuration.py`, `src/utils/vector_search.py`	Replaced manual string concatenation with `urllib.parse.urljoin` for standardized URL joining. Updated `enrich_solr` to derive Solr URL by combining OKP base URL with `/solr` path, and refactored `_join_okp_doc_url` to use `urljoin` for document URL construction instead of slash/protocol manipulation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix OKP Solr RAG URL enrichment' directly and clearly summarizes the main change: fixing the URL construction logic for OKP Solr RAG by replacing manual string concatenation with urllib.parse.urljoin.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llama_stack_configuration.py`:
- Around line 472-475: The current solr_url construction uses urljoin(base_url,
"/solr") which treats "/solr" as an absolute path and discards path segments in
base_url; update the solr_url logic to ensure base_url ends with a trailing
slash (when using rhokp_raw or constants.RH_SERVER_OKP_DEFAULT_URL) and call
urljoin with the relative segment "solr" instead of "/solr" (i.e., produce
solr_url using base_url if base_url.endswith("/") else base_url + "/" and
urljoin(..., "solr")); adjust the code where base_url is defined (variables:
rhokp_raw, base_url) and where solr_url is set (solr_url, urljoin) accordingly.

In `@src/utils/vector_search.py`:
- Line 10: _join_okp_doc_url currently allows
absolute/scheme-relative/absolute-path references which can escape the
configured OKP origin; fix it in _join_okp_doc_url by parsing the incoming
reference (using urllib.parse.urlparse), stripping any scheme and netloc,
removing leading slashes from the parsed.path, and then reconstructing a
relative reference from path (+ params/query/fragment if needed) before calling
urljoin with the configured base; this ensures references derived from
parent_id/doc_id/reference_url remain origin-relative and cannot override the
OKP host.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5cdcedac-adad-4447-95b6-2f2af373f693

📥 Commits

Reviewing files that changed from the base of the PR and between ca47172 and c05c1d8.

📒 Files selected for processing (2)

src/llama_stack_configuration.py
src/utils/vector_search.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E Tests for Lightspeed Evaluation job
GitHub Check: Pylinter
GitHub Check: integration_tests (3.12)
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: integration_tests (3.13)
GitHub Check: Pyright
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: mypy
GitHub Check: unit_tests (3.12)
GitHub Check: unit_tests (3.13)
GitHub Check: build-pr

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

src/llama_stack_configuration.py
src/utils/vector_search.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models extend ConfigurationBase for config, BaseModel for data models

Files:

src/llama_stack_configuration.py
src/utils/vector_search.py

coderabbitai · 2026-04-21T14:47:13Z

+    base_url = (
        str(rhokp_raw) if rhokp_raw is not None else constants.RH_SERVER_OKP_DEFAULT_URL
    )
+    solr_url = urljoin(base_url, "/solr")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' from urllib.parse import urljoin cases = [ ("${env.RH_SERVER_OKP}", "/solr"), ("${env.RH_SERVER_OKP}/", "solr"), ("https://okp.example.com/prefix", "/solr"), ("https://okp.example.com/prefix/", "solr"), ] for base, ref in cases: print(f"urljoin({base!r}, {ref!r}) -> {urljoin(base, ref)!r}") PY

Repository: lightspeed-core/lightspeed-stack

Length of output: 376

🏁 Script executed:

# Verify the file exists and check the specific lines head -500 src/llama_stack_configuration.py | tail -50

Repository: lightspeed-core/lightspeed-stack

Length of output: 2005

🏁 Script executed:

# Search for the specific code pattern and context rg -n "urljoin.*solr" -A 2 -B 2 src/

Repository: lightspeed-core/lightspeed-stack

Length of output: 455

🏁 Script executed:

# Check the constants file for RH_SERVER_OKP_DEFAULT_URL rg -n "RH_SERVER_OKP_DEFAULT_URL" src/

Repository: lightspeed-core/lightspeed-stack

Length of output: 375

🏁 Script executed:

# Get more context around lines 472-475 sed -n '460,485p' src/llama_stack_configuration.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 906

Use relative path segment to preserve base URL.

Line 475 uses an absolute path /solr in urljoin, which resets the URL to the root. For base URLs like "${env.RH_SERVER_OKP}" or "https://host/okp", this drops the base entirely or the path prefix. Ensure the base URL has a trailing slash and use "solr" as a relative segment:

solr_url = urljoin(base_url if base_url.endswith("/") else base_url + "/", "solr")

This preserves the full base URL in all cases: "${env.RH_SERVER_OKP}/solr" and "https://host/okp/solr".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llama_stack_configuration.py` around lines 472 - 475, The current solr_url construction uses urljoin(base_url, "/solr") which treats "/solr" as an absolute path and discards path segments in base_url; update the solr_url logic to ensure base_url ends with a trailing slash (when using rhokp_raw or constants.RH_SERVER_OKP_DEFAULT_URL) and call urljoin with the relative segment "solr" instead of "/solr" (i.e., produce solr_url using base_url if base_url.endswith("/") else base_url + "/" and urljoin(..., "solr")); adjust the code where base_url is defined (variables: rhokp_raw, base_url) and where solr_url is set (solr_url, urljoin) accordingly.

RHOKP's internal Solr service is always at /solr and can't be at a subpatch like https://host/okp/solr. If that ever changes, then we can update this code.

@mwcz, thanks for the clarification! That makes sense — since the Solr service is always rooted at /solr on the RHOKP host, using an absolute path with urljoin is the correct behavior here. My concern doesn't apply. The current implementation is correct.

(∪･ω･∪)

✏️ Learnings added

Learnt from: mwcz Repo: lightspeed-core/lightspeed-stack PR: 1561 File: src/llama_stack_configuration.py:472-475 Timestamp: 2026-04-21T15:17:16.516Z Learning: In the lightspeed-stack repository (src/llama_stack_configuration.py), the RHOKP internal Solr service is always located at the absolute path `/solr` on the host (e.g., `https://host/solr`). It is never at a subpath like `/okp/solr`. Therefore, using `urljoin(base_url, "/solr")` is intentional and correct — the absolute `/solr` path should not be changed to a relative segment.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-04-21T14:47:13Z

 import asyncio
 import traceback
 from typing import Any, Optional, cast
+from urllib.parse import urljoin


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' from urllib.parse import urljoin base = "https://okp.example.com/prefix" references = [ "/articles/123", "articles/123", "//evil.example/articles/123", "https://evil.example/articles/123", ] for reference in references: print(f"urljoin({base!r}, {reference!r}) -> {urljoin(base, reference)!r}") PY

Repository: lightspeed-core/lightspeed-stack

Length of output: 506

🏁 Script executed:

# First, check if the file exists and examine the import and function fd vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 141

🏁 Script executed:

# Check the file size first wc -l src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 106

🏁 Script executed:

# Read the import section around line 10 head -20 src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 656

🏁 Script executed:

# Read the function around lines 568-580 sed -n '560,590p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 967

🏁 Script executed:

# Find all calls to _join_okp_doc_url rg "_join_okp_doc_url" -B 3 -A 3

Repository: lightspeed-core/lightspeed-stack

Length of output: 1787

🏁 Script executed:

# Find _get_okp_base_url and understand the base URL configuration rg "_get_okp_base_url" -B 5 -A 10

Repository: lightspeed-core/lightspeed-stack

Length of output: 7860

🏁 Script executed:

# Look at the _build_document_url function context more broadly sed -n '582,620p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1321

🏁 Script executed:

# Check for tests of _join_okp_doc_url and related functionality rg "join_okp_doc_url\|_build_document_url" tests/ -B 2 -A 10

Repository: lightspeed-core/lightspeed-stack

Length of output: 58

🏁 Script executed:

# Look at the actual metadata sources to understand the data flow rg "parent_id\|reference_url\|chunk.metadata" src/utils/vector_search.py -B 2 -A 2

Repository: lightspeed-core/lightspeed-stack

Length of output: 58

🏁 Script executed:

# Check the test file directly for tests related to _join_okp_doc_url grep -n "join_okp_doc_url\|_build_document_url" tests/unit/utils/test_vector_search.py -B 3 -A 10

Repository: lightspeed-core/lightspeed-stack

Length of output: 2750

🏁 Script executed:

# Look at parent_id and reference_url usage in the code sed -n '595,650p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 2140

Enforce origin-relative references in _join_okp_doc_url to prevent URL hijacking.

The function's docstring promises "origin-relative document path," but the implementation accepts absolute URLs (https://...), scheme-relative URLs (//...), and absolute paths (/...), all of which override the configured base. If reference parameter values (from parent_id, doc_id, or reference_url metadata) contain such URLs—whether through API poisoning or data corruption—the function silently generates URLs outside the intended OKP host.

Strip any scheme and netloc from the reference before joining against the base URL, and remove leading slashes to ensure the reference remains relative:

Proposed fix

-from urllib.parse import urljoin +from urllib.parse import urljoin, urlsplit, urlunsplit

def _join_okp_doc_url(base_url: AnyUrl, reference: Optional[str]) -> str: """Build a well-formed document URL from base and reference path. @@ Returns: Well-formed doc_url string, or empty string if reference is empty. """ - if not reference: + if reference is None: return "" - return urljoin(str(base_url), reference) + + reference = reference.strip() + if not reference: + return "" + + parsed_reference = urlsplit(reference) + if parsed_reference.scheme or parsed_reference.netloc: + reference = urlunsplit( + ( + "", + "", + parsed_reference.path, + parsed_reference.query, + parsed_reference.fragment, + ) + ) + + reference = reference.lstrip("/") + if not reference: + return "" + + return urljoin(f"{str(base_url).rstrip('/')}/", reference)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` at line 10, _join_okp_doc_url currently allows absolute/scheme-relative/absolute-path references which can escape the configured OKP origin; fix it in _join_okp_doc_url by parsing the incoming reference (using urllib.parse.urlparse), stripping any scheme and netloc, removing leading slashes from the parsed.path, and then reconstructing a relative reference from path (+ params/query/fragment if needed) before calling urljoin with the configured base; this ensures references derived from parent_id/doc_id/reference_url remain origin-relative and cannot override the OKP host.

Anxhela21

/LGTM

syedriko · 2026-04-21T15:12:03Z

@are-ces PTAL, re: #1431

tisnik

LGTM

fix solr url construction from rhokp_url

c05c1d8

mwcz mentioned this pull request Apr 21, 2026

Fix Solr OKP default URL missing /solr path #1431

Draft

4 tasks

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Anxhela21 approved these changes Apr 21, 2026

View reviewed changes

tisnik approved these changes Apr 21, 2026

View reviewed changes

tisnik requested a review from are-ces April 21, 2026 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OKP Solr RAG URL enrichment#1561

Fix OKP Solr RAG URL enrichment#1561
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
mwcz:fix-rhokp-url-solr-url

mwcz commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026 •

edited

Loading

Uh oh!

mwcz Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

Anxhela21 left a comment

Uh oh!

syedriko commented Apr 21, 2026

Uh oh!

tisnik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mwcz commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwcz Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Anxhela21 left a comment

Choose a reason for hiding this comment

Uh oh!

syedriko commented Apr 21, 2026

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mwcz commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 21, 2026 •

edited

Loading