chore: Release release/2026-W24#6635
Open
github-actions[bot] wants to merge 468 commits into
Open
Conversation
This reverts commit 9395e1d.
…ermission-issue fix(auditor): Perm checks ignore when creating from release
Adds `Incident Settings Night Shift` child doctype to assign specific users per day-of-week for night hours. During night (outside DAY_HOURS), if a shift is defined for today, those users replace the default list. Repeat calls use round-robin ordering (next after last acknowledged) instead of calling the same person again. Falls back to default users when no shift is defined for today or shift users are absent from the main users table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds NIGHT_SHIFT_CALL_LIMIT (20). Once that many call attempts have been made (tracked via the updates child table) without resolving the incident, get_humans() skips night shift and falls back to the full default user list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test was failing with AuthenticationError because running the investigator foreground caused it to call get_prometheus_client() without credentials. The underlying issue: running investigation foreground also completed the investigator immediately, making waited_enough_for_investigator_reactions return True before the first resolve_incidents() check. Fix: mock the investigator's frappe.enqueue_doc to a no-op so investigate() is never queued, keeping the investigator in Investigating state for the first resolve_incidents() check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Also extract call-per-human logic into _attempt_call_human to stay within ruff C901 complexity limit.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…e-in-marketplace fix(marketplace): Add 'Press User' role permissions
fix(deploy-ui): Show pre-build errors
fix: Guard add_resource and sync against cross-team docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the latest site update is cancelled, fetch the press notification and display its message as a warning banner at the top of the updates list. Uses a new extraResource hook in ObjectList to load secondary data lazily. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explain what happened and how to fix it in the backup restore, app changes, and login-as-admin error toasts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Resolved all merge conflicts. Here's a summary of what was done:
Commit: |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (41.01%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #6635 +/- ##
==========================================
+ Coverage 49.97% 50.33% +0.36%
==========================================
Files 955 993 +38
Lines 79069 83514 +4445
Branches 374 523 +149
==========================================
+ Hits 39511 42037 +2526
- Misses 39532 41445 +1913
- Partials 26 32 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Previously, the investigation report for 500 and 502 errors would tell the support agent to manually open web.error.log via the log browser. This adds a collector that fetches and parses the log automatically. The collector: - Fetches web.error.log via the existing site.get_server_log() agent call. - Parses gunicorn-format entries with a compiled regex, grouping each ERROR/CRITICAL line with the trailing traceback lines that follow it. - Captures only the final exception message line from each traceback — not the full stack frames with local variables, which could carry personal information. - Runs all entries through redact() before storing. - Scans the last 500 lines and returns at most 10 error blocks. The report generator classifies the collected errors into three patterns: - OperationalError / "can't connect" → database connectivity failure. - ImportError / ModuleNotFoundError → broken app state after deployment. - CRITICAL level entries → worker crash or timeout. - Anything else → generic exception, with the message surfaced. The spec Non-Goals are updated to allow redacted exception message lines (the final line of a traceback) as structured data while continuing to exclude raw stack frames with local variables. (cherry picked from commit 756bccf)
Extends get_site_performance_summary with anomaly detection and custom-app identification to give 504 investigations more specific signal before falling back to a generic "use Recorder" recommendation. Changes: - Fetches up to 20 endpoints (was 5) so custom-app paths are not crowded out by core Frappe endpoints in the ranking. - Adds spike_detected per endpoint: peak >= 3x mean AND peak > 2 s. This surfaces endpoints that are occasionally very slow (e.g. a specific document type or a scheduled-job-triggered slow query) even when the 24-hour average is below the 1 s threshold. - Adds is_custom per endpoint: extracts the Python module name from /api/method/<module>.* paths and checks whether that module belongs to a non-Frappe app. App origin is determined by repository_owner on the AppSource record; anything other than "frappe" is custom. - Adds has_custom_apps to the summary so the report knows whether the bench has any non-Frappe apps at all. The bench name is now passed from collect_site_context so the app-source lookup can happen without an extra site query. (cherry picked from commit 3a54fbd)
Splits _add_performance_evidence into three focused functions: - _add_slow_endpoint_evidence: handles consistently-slow endpoints. When the slow endpoint belongs to a non-Frappe app it changes the cause to "custom app endpoints are slow — application-level" instead of the generic "web workers" cause. - _add_spiky_endpoint_evidence: handles endpoints where peak >= 3x mean and peak > 2 s, adding evidence and suggesting Recorder to capture the specific triggering request. - _add_performance_evidence: computes the slow/spiky lists and dispatches. Both conditions are checked independently so an endpoint can be both consistently slow and spiky (e.g. always 2 s but sometimes 30 s). Adds four new tests: - test_500_worker_timeout_in_web_log_flags_critical - test_504_custom_app_endpoint_flagged_as_application_level - test_504_spiky_endpoint_flagged_even_with_low_average - test_504_frappe_endpoint_slow_flags_web_workers (cherry picked from commit c94adf8)
Updates the 504 section to document the new endpoint analysis behavior: app origin detection via AppSource.repository_owner, spike detection (peak >= 3x mean and peak > 2 s), and how the report adjusts its cause and next steps based on whether the slow endpoint is from a custom app. Also updates the Collectors section to describe the enhanced site_performance_summary fields: is_custom, spike_detected, has_custom_apps, and the expanded 20-endpoint fetch window. (cherry picked from commit 2e0716d)
Replace the two external API calls in the investigation collectors with the shared HTTP clients from press/mcp: - get_server_metrics: was calling press.api.server.prometheus_query, a whitelisted API wrapper that also does timezone conversion and label alignment we don't need. Now uses prometheus_get from press.mcp.tools.telemetry.clients directly. Add _prom_params (builds the query_range param dict) and _prom_values (extracts a flat list of floats from the Prometheus matrix response). Simplify _summarise_series to take list[float] instead of the datasets dict that prometheus_query returned. - get_site_performance_summary: was calling get_request_by_ from press.api.analytics, which returns per-time-bucket datasets that we then manually averaged. Now uses elasticsearch_post with a terms aggregation that returns avg_duration_ms and max_duration_ms per path directly — same spike detection, fewer moving parts. Add _slow_endpoint_query (builds the ES body) and _parse_slow_endpoints (converts buckets to the endpoint dicts report.py expects). No changes to press/mcp tooling. The decorated @press_mcp.tool functions are not called. (cherry picked from commit ac5c04c)
Add get_bench_process_status collector, which calls Bench.supervisorctl_status() (the same agent call the MCP's get_bench_processes tool makes) and returns a list of processes not in Running or Starting state. If the gunicorn web process is Fatal or Stopped, the report now flags it as a direct cause of 502 errors rather than leaving the support agent to discover it through logs. The next-step recommendation is to check web.error.log and recent deployments before restarting — a bare restart without diagnosis will recur. Worker processes that are stopped (but the web process is fine) are surfaced as evidence only, not a cause — stopped background workers cause job failures, not 502s. Two new tests: one for a Fatal gunicorn web process, one confirming that all-running processes produce no process-level cause. (cherry picked from commit eeb8c38)
Semgrep flagged `f == f` as a useless equality check. Switch to the explicit `math` module check and add `import math`. (cherry picked from commit 1a274c8)
Tests call collect_site_context → generate_report with prometheus_get and elasticsearch_post returning controlled payloads. This verifies the full transformation pipeline — Prometheus matrix response → _prom_values → _summarise_series → report cause — rather than constructing the payload dict directly. frappe_mcp is not installed in test environments. The test file stubs it in sys.modules at import time (before any press.mcp submodule is touched by the patch machinery) so the import chain succeeds without errors. Six scenarios covered: CPU spike, flat CPU (no spike), uniformly slow endpoint, spiky endpoint, stopped gunicorn web process, database connectivity error in web.error.log. (cherry picked from commit 7e46af6)
Mypy requires explicit annotations for dicts with heterogeneous nested types that it cannot unambiguously infer. Add `: dict` to _PROM_EMPTY and _ES_EMPTY. (cherry picked from commit 35601d8)
Add an Anthropic API key field to Press Settings (Monitoring section). Add a 'Get AI Analysis' button to the completed investigation form. Clicking it sends the already-redacted payload to claude-sonnet-4-6 via the Anthropic Messages API (plain HTTP, no extra package required) and stores the response in a new ai_response field. The model only ever receives data that has already passed through the redaction pipeline — no raw payloads, no personally identifiable data. The controller validates that the investigation is Completed before allowing the call. (cherry picked from commit f94e4a2)
Add probe_success and probe_http_status_code from the blackbox Prometheus exporter so the report can surface a DOWN probe or a 5xx status code as a direct cause without waiting for log analysis. Strip site.name from the payload copy before building the model prompt. The model does not need the customer site identity to reason about platform signals. (cherry picked from commit bcfd9ad)
Document the site_uptime collector (blackbox probe_success and probe_http_status_code), and replace the forward-looking model extension point section with the actual AI Analysis implementation: what is sent to the model, the two-stage privacy boundary (redact + anonymize), how to configure the API key, and what the model is asked to produce. Also add test_investigation to the verification commands. (cherry picked from commit e673f38)
Extend _anonymise() to remove all platform identity fields from the site and bench sections of the payload, not just the site name. Stripped from site: name, bench, server, database_server, cluster, group. Stripped from bench: name, server, database_server, cluster, candidate, build. The model only needs status flags and metrics to reason about the issue. (cherry picked from commit c6d9741)
Check that anthropic_api_key is set in Press Settings at the start of run_ai_analysis(), before any payload is loaded or sent. Remove the duplicate check from analyse() — validation belongs at the boundary. The deterministic run() method never calls the model. AI analysis is always an explicit manual action via the form button. (cherry picked from commit cf62523)
The timeout message is internal — the support agent knows the context and retry path. Suppress non-actionable-error-message inline. (cherry picked from commit 4fc1085)
feat(support): Collect logs, metrics and process state in investigations (backport #6643)
Gunicorn's stderr is a bench-level file at
benches/{bench}/logs/web.error.log. Reading it via Site.get_server_log
was hitting the site-specific path (benches/{bench}/sites/{site}/logs/)
which either doesn't exist or is empty — so no errors were ever
collected.
This matters most for startup failures (syntax errors in app.py) where
gunicorn workers crash before any request is served, because those errors
land only in the gunicorn log and nowhere site-specific.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit a4b7374)
fix(support): Read web.error.log from bench, not site (backport #6646)
- Rectify mismatching instance types - Add two additional plans present in db but absent in fixture (cherry picked from commit b5ecd80)
(cherry picked from commit 14b1311)
(cherry picked from commit cb5abae)
fix(server-plan-fixtures): Sync server plans from prod (backport #6640)
fix(onboarding): Update perm issues
(cherry picked from commit 8d1e382)
fix(json): Remove trailing comma (backport #6655)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Weekly release PR