Implement URL Context Integration into LLM Prompts

This feature aims to significantly enhance the LLM's understanding and response quality by allowing users to include external web content directly within their prompts. The system will automatically detect, fetch, parse, and condense URLs provided by the user, transforming them into an LLM-ready format.

### 1. Core Functionality & User Flow

The system should seamlessly integrate URL processing into the user's chat experience:

*   **URL Extraction (Frontend - Real-time Feedback):**
    *   **Trigger:** When a user pastes text into the message input field or types a URL, the frontend should immediately detect valid URLs.
    *   **Visual Feedback:** As URLs are detected, they should visually transform into a compact, distinct "chip" or pill-shaped element *within the message input area*. This provides instant feedback that the URL has been recognized for parsing.
    *   **User Experience:** This real-time transformation allows the user to see which URLs will be processed *before* sending the message, enabling them to adjust their input.
    *   **Validation:** Basic frontend validation should ensure the URL format is plausible (e.g., `http(s)://...`).
    *   **"Unparse" Functionality:**
        *   Users should be able to "unparse" a URL. This would be via a small "X" icon on the chip itself.
        *   When "unparsed", the chip reverts to the original plain text URL in the message display.
        *   The specific URL is no longer sent to the LLM.

*   **URL Fetching & Parsing (Backend - `firecrawl`):**
    *   **Trigger:** When the message is sent, the backend initiates the fetching process.
    *   **Tooling:** Utilize `https://github.com/mendableai/firecrawl` (or a similar robust web scraping solution) to fetch the content of each detected URL.
    *   **Content Extraction:** The primary goal is to extract the main textual content from the webpage. Considerations for `firecrawl`:
        *   How to handle different content types (e.g., articles, product pages, forum posts).
        *   Filtering out boilerplate (headers, footers, sidebars, ads).
        *   Prioritizing semantic content.
    *   **LLM-Ready Format:** The extracted content must be transformed into a concise, plain-text format suitable for LLM input. This may involve:
        *   Stripping HTML/CSS.
        *   Summarization or truncation if the content is excessively long (define a maximum character limit for parsed content per URL, e.g., 2000-4000 characters) (maybe add a setting to enable/disable summarization)
        *   Potentially prefixing with a tag like "Context from URL: [URL]" to clearly delineate it for the LLM.
    *   **Error Handling (Backend):**
        *   Gracefully handle common web errors (404 Not Found, 500 Server Error, timeouts, network issues).
        *   Handle inaccessible content (paywalls, CAPTCHAs, bot blocking).
        *   Inform the frontend if parsing fails for a specific URL, so it can display an appropriate error state.

*   **UI Representation of Parsed URLs (Frontend - Post-Send):**
    *   **Chip Display:** In the displayed user message (after sending), the original URL text will be replaced by a visually distinct "chip" that represents the parsed content.
    *   **Chip Appearance:**
        *   Should include the domain name or a truncated URL.
        *   Ideally, display the website's favicon (if retrievable during parsing).
        *   Tooltip on hover could show the full URL or a short summary of the parsed content.
    *   **Interactive Behavior (`<a>` tag):** Clicking the chip should behave like a standard `<a>` tag, opening the original URL in a new browser tab.

*   **Message Footer Chip (Frontend):**
    *   **Visibility:** If one or more URLs are successfully parsed and included in the prompt, a small chip should appear in the user's message footer.
    *   **Content:** This chip should display the number of parsed URLs, e.g., "3 parsed URLs".
    *   **Interactvity:** Clicking this chip could expand a small list of the parsed URLs, showing their titles or truncated content, providing a quick overview.

### 2. User Settings & Control

*   **Toggle Auto-Parsing (General Settings):**
    *   A dedicated toggle switch in the application's "General Settings" section will allow users to enable or disable the automatic URL parsing feature entirely.
    *   When disabled, URLs will appear as plain text in the message input and sent messages, and their content will *not* be fetched or sent to the LLM.

*   **URL Blacklist (Settings):**
    *   In the "Settings" section, users can define a list of URLs or URL patterns that should *never* be automatically parsed.
    *   **Input Method:** Provide a text area or a list input where users can add one entry per line or comma-separated entries.
    *   **Matching Logic:** Support matching based on:
        *   **Exact URL:** `https://example.com/sensitive-page`
        *   **Domain:** `example.com` (to block all URLs from that domain)
        *   **Subdomain:** `sub.example.com`
        *   **Wildcards/Regex (Advanced):** Consider supporting basic wildcards (`*.example.com`) or regular expressions for more complex blocking patterns.
    *   **Examples:** Clearly provide examples like `localhost`, `127.0.0.1`, internal network addresses, or specific sensitive company URLs.

### 3. Technical Considerations & Edge Cases

*   **LLM Integration:**
    *   How will the parsed content be incorporated into the LLM prompt? (e.g., appended to the user's prompt, passed as a separate `context` parameter, or via a dedicated system message).
    *   Token limits for the LLM. If multiple URLs are parsed, the combined content might exceed the limit. Implement truncation or intelligent summarization.

*   **Security:**
    *   **SSRF (Server-Side Request Forgery) Prevention:** Implement strict validation and sanitization of URLs to prevent the backend from being coerced into making requests to internal systems or malicious external resources.
    *   **Malicious Content:** Be mindful of fetching content from potentially malicious sites (e.g., large files, infinite redirects). Implement timeouts and size limits.

*   **Performance & Scalability:**
    *   **Caching:** Caching parsed URL content for a defined period to reduce redundant fetches, especially if the same URL is shared multiple times or across different users.

*   **Error Handling (Frontend Display):**
    *   If a URL fails to parse (e.g., 404, timeout, paywall), the chip in the message should visually indicate an error state (e.g., red border, error icon) and potentially a tooltip explaining the failure. It should *not* send empty or broken context to the LLM.

*   **State Management:**
    *   The state of parsed URLs (original URL, parsed content, success/failure) needs to be stored as part of the message history to ensure consistency when re-loading conversations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement URL Context Integration into LLM Prompts #22

1. Core Functionality & User Flow

2. User Settings & Control

3. Technical Considerations & Edge Cases

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Implement URL Context Integration into LLM Prompts #22

Description

1. Core Functionality & User Flow

2. User Settings & Control

3. Technical Considerations & Edge Cases

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions