Skip to content

Fix OKP Solr RAG URL enrichment#1561

Open
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
mwcz:fix-rhokp-url-solr-url
Open

Fix OKP Solr RAG URL enrichment#1561
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
mwcz:fix-rhokp-url-solr-url

Conversation

@mwcz
Copy link
Copy Markdown
Contributor

@mwcz mwcz commented Apr 21, 2026

This is an alternative to #1431

Summary

  • Fixes some areas where the code was confused about the distinction between RH_SERVER_OKP URL and the Solr RAG URL. RH_SERVER_OKP is meant to be the base URL of the OKP instance, and reference links can be joined onto it (ex: $(RH_SERVER_OKP)/articles/1234), and the Solr URL can also be joined onto it (ex: $(RH_SERVER_OKP)/solr).
  • Replace manual string concatenation for URL construction with urllib.parse.urljoin in both llama_stack_configuration.py (solr URL from rhokp_url) and utils/vector_search.py (OKP document URLs).
  • Removes custom logic for stripping/appending slashes and handling absolute URLs, since urljoin handles all of that correctly.

Test plan

  • Verify solr URL is correctly constructed from rhokp_url config
  • Verify OKP document URLs are correctly built from base URL + reference path
  • Run uv run make test-unit to confirm no regressions

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Improved URL construction logic in internal components for better standardization and maintainability.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Walkthrough

The changes refactor URL construction logic in two configuration and utility functions to use urllib.parse.urljoin instead of manual string concatenation. The Solr base URL in configuration enrichment and the document URL joining in vector search utilities are updated to leverage URL joining semantics.

Changes

Cohort / File(s) Summary
URL Construction Refactoring
src/llama_stack_configuration.py, src/utils/vector_search.py
Replaced manual string concatenation with urllib.parse.urljoin for standardized URL joining. Updated enrich_solr to derive Solr URL by combining OKP base URL with /solr path, and refactored _join_okp_doc_url to use urljoin for document URL construction instead of slash/protocol manipulation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix OKP Solr RAG URL enrichment' directly and clearly summarizes the main change: fixing the URL construction logic for OKP Solr RAG by replacing manual string concatenation with urllib.parse.urljoin.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llama_stack_configuration.py`:
- Around line 472-475: The current solr_url construction uses urljoin(base_url,
"/solr") which treats "/solr" as an absolute path and discards path segments in
base_url; update the solr_url logic to ensure base_url ends with a trailing
slash (when using rhokp_raw or constants.RH_SERVER_OKP_DEFAULT_URL) and call
urljoin with the relative segment "solr" instead of "/solr" (i.e., produce
solr_url using base_url if base_url.endswith("/") else base_url + "/" and
urljoin(..., "solr")); adjust the code where base_url is defined (variables:
rhokp_raw, base_url) and where solr_url is set (solr_url, urljoin) accordingly.

In `@src/utils/vector_search.py`:
- Line 10: _join_okp_doc_url currently allows
absolute/scheme-relative/absolute-path references which can escape the
configured OKP origin; fix it in _join_okp_doc_url by parsing the incoming
reference (using urllib.parse.urlparse), stripping any scheme and netloc,
removing leading slashes from the parsed.path, and then reconstructing a
relative reference from path (+ params/query/fragment if needed) before calling
urljoin with the configured base; this ensures references derived from
parent_id/doc_id/reference_url remain origin-relative and cannot override the
OKP host.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5cdcedac-adad-4447-95b6-2f2af373f693

📥 Commits

Reviewing files that changed from the base of the PR and between ca47172 and c05c1d8.

📒 Files selected for processing (2)
  • src/llama_stack_configuration.py
  • src/utils/vector_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: Pylinter
  • GitHub Check: integration_tests (3.12)
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: integration_tests (3.13)
  • GitHub Check: Pyright
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: mypy
  • GitHub Check: unit_tests (3.12)
  • GitHub Check: unit_tests (3.13)
  • GitHub Check: build-pr
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

  • src/llama_stack_configuration.py
  • src/utils/vector_search.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models extend ConfigurationBase for config, BaseModel for data models

Files:

  • src/llama_stack_configuration.py
  • src/utils/vector_search.py

Comment on lines +472 to +475
base_url = (
str(rhokp_raw) if rhokp_raw is not None else constants.RH_SERVER_OKP_DEFAULT_URL
)
solr_url = urljoin(base_url, "/solr")
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
from urllib.parse import urljoin

cases = [
    ("${env.RH_SERVER_OKP}", "/solr"),
    ("${env.RH_SERVER_OKP}/", "solr"),
    ("https://okp.example.com/prefix", "/solr"),
    ("https://okp.example.com/prefix/", "solr"),
]

for base, ref in cases:
    print(f"urljoin({base!r}, {ref!r}) -> {urljoin(base, ref)!r}")
PY

Repository: lightspeed-core/lightspeed-stack

Length of output: 376


🏁 Script executed:

# Verify the file exists and check the specific lines
head -500 src/llama_stack_configuration.py | tail -50

Repository: lightspeed-core/lightspeed-stack

Length of output: 2005


🏁 Script executed:

# Search for the specific code pattern and context
rg -n "urljoin.*solr" -A 2 -B 2 src/

Repository: lightspeed-core/lightspeed-stack

Length of output: 455


🏁 Script executed:

# Check the constants file for RH_SERVER_OKP_DEFAULT_URL
rg -n "RH_SERVER_OKP_DEFAULT_URL" src/

Repository: lightspeed-core/lightspeed-stack

Length of output: 375


🏁 Script executed:

# Get more context around lines 472-475
sed -n '460,485p' src/llama_stack_configuration.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 906


Use relative path segment to preserve base URL.

Line 475 uses an absolute path /solr in urljoin, which resets the URL to the root. For base URLs like "${env.RH_SERVER_OKP}" or "https://host/okp", this drops the base entirely or the path prefix. Ensure the base URL has a trailing slash and use "solr" as a relative segment:

solr_url = urljoin(base_url if base_url.endswith("/") else base_url + "/", "solr")

This preserves the full base URL in all cases: "${env.RH_SERVER_OKP}/solr" and "https://host/okp/solr".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 472 - 475, The current
solr_url construction uses urljoin(base_url, "/solr") which treats "/solr" as an
absolute path and discards path segments in base_url; update the solr_url logic
to ensure base_url ends with a trailing slash (when using rhokp_raw or
constants.RH_SERVER_OKP_DEFAULT_URL) and call urljoin with the relative segment
"solr" instead of "/solr" (i.e., produce solr_url using base_url if
base_url.endswith("/") else base_url + "/" and urljoin(..., "solr")); adjust the
code where base_url is defined (variables: rhokp_raw, base_url) and where
solr_url is set (solr_url, urljoin) accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RHOKP's internal Solr service is always at /solr and can't be at a subpatch like https://host/okp/solr. If that ever changes, then we can update this code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwcz, thanks for the clarification! That makes sense — since the Solr service is always rooted at /solr on the RHOKP host, using an absolute path with urljoin is the correct behavior here. My concern doesn't apply. The current implementation is correct.

(∪・ω・∪)


✏️ Learnings added
Learnt from: mwcz
Repo: lightspeed-core/lightspeed-stack PR: 1561
File: src/llama_stack_configuration.py:472-475
Timestamp: 2026-04-21T15:17:16.516Z
Learning: In the lightspeed-stack repository (src/llama_stack_configuration.py), the RHOKP internal Solr service is always located at the absolute path `/solr` on the host (e.g., `https://host/solr`). It is never at a subpath like `/okp/solr`. Therefore, using `urljoin(base_url, "/solr")` is intentional and correct — the absolute `/solr` path should not be changed to a relative segment.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

import asyncio
import traceback
from typing import Any, Optional, cast
from urllib.parse import urljoin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
from urllib.parse import urljoin

base = "https://okp.example.com/prefix"
references = [
    "/articles/123",
    "articles/123",
    "//evil.example/articles/123",
    "https://evil.example/articles/123",
]

for reference in references:
    print(f"urljoin({base!r}, {reference!r}) -> {urljoin(base, reference)!r}")
PY

Repository: lightspeed-core/lightspeed-stack

Length of output: 506


🏁 Script executed:

# First, check if the file exists and examine the import and function
fd vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 141


🏁 Script executed:

# Check the file size first
wc -l src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 106


🏁 Script executed:

# Read the import section around line 10
head -20 src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 656


🏁 Script executed:

# Read the function around lines 568-580
sed -n '560,590p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 967


🏁 Script executed:

# Find all calls to _join_okp_doc_url
rg "_join_okp_doc_url" -B 3 -A 3

Repository: lightspeed-core/lightspeed-stack

Length of output: 1787


🏁 Script executed:

# Find _get_okp_base_url and understand the base URL configuration
rg "_get_okp_base_url" -B 5 -A 10

Repository: lightspeed-core/lightspeed-stack

Length of output: 7860


🏁 Script executed:

# Look at the _build_document_url function context more broadly
sed -n '582,620p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1321


🏁 Script executed:

# Check for tests of _join_okp_doc_url and related functionality
rg "join_okp_doc_url\|_build_document_url" tests/ -B 2 -A 10

Repository: lightspeed-core/lightspeed-stack

Length of output: 58


🏁 Script executed:

# Look at the actual metadata sources to understand the data flow
rg "parent_id\|reference_url\|chunk.metadata" src/utils/vector_search.py -B 2 -A 2

Repository: lightspeed-core/lightspeed-stack

Length of output: 58


🏁 Script executed:

# Check the test file directly for tests related to _join_okp_doc_url
grep -n "join_okp_doc_url\|_build_document_url" tests/unit/utils/test_vector_search.py -B 3 -A 10

Repository: lightspeed-core/lightspeed-stack

Length of output: 2750


🏁 Script executed:

# Look at parent_id and reference_url usage in the code
sed -n '595,650p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 2140


Enforce origin-relative references in _join_okp_doc_url to prevent URL hijacking.

The function's docstring promises "origin-relative document path," but the implementation accepts absolute URLs (https://...), scheme-relative URLs (//...), and absolute paths (/...), all of which override the configured base. If reference parameter values (from parent_id, doc_id, or reference_url metadata) contain such URLs—whether through API poisoning or data corruption—the function silently generates URLs outside the intended OKP host.

Strip any scheme and netloc from the reference before joining against the base URL, and remove leading slashes to ensure the reference remains relative:

Proposed fix
-from urllib.parse import urljoin
+from urllib.parse import urljoin, urlsplit, urlunsplit
 def _join_okp_doc_url(base_url: AnyUrl, reference: Optional[str]) -> str:
     """Build a well-formed document URL from base and reference path.
@@
     Returns:
         Well-formed doc_url string, or empty string if reference is empty.
     """
-    if not reference:
+    if reference is None:
         return ""
-    return urljoin(str(base_url), reference)
+
+    reference = reference.strip()
+    if not reference:
+        return ""
+
+    parsed_reference = urlsplit(reference)
+    if parsed_reference.scheme or parsed_reference.netloc:
+        reference = urlunsplit(
+            (
+                "",
+                "",
+                parsed_reference.path,
+                parsed_reference.query,
+                parsed_reference.fragment,
+            )
+        )
+
+    reference = reference.lstrip("/")
+    if not reference:
+        return ""
+
+    return urljoin(f"{str(base_url).rstrip('/')}/", reference)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` at line 10, _join_okp_doc_url currently allows
absolute/scheme-relative/absolute-path references which can escape the
configured OKP origin; fix it in _join_okp_doc_url by parsing the incoming
reference (using urllib.parse.urlparse), stripping any scheme and netloc,
removing leading slashes from the parsed.path, and then reconstructing a
relative reference from path (+ params/query/fragment if needed) before calling
urljoin with the configured base; this ensures references derived from
parent_id/doc_id/reference_url remain origin-relative and cannot override the
OKP host.

Copy link
Copy Markdown
Contributor

@Anxhela21 Anxhela21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@syedriko
Copy link
Copy Markdown
Contributor

@are-ces PTAL, re: #1431

Copy link
Copy Markdown
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik requested a review from are-ces April 21, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants