Skip to content

chore: Release release/2026-W24#6635

Open
github-actions[bot] wants to merge 468 commits into
masterfrom
release/2026-W24
Open

chore: Release release/2026-W24#6635
github-actions[bot] wants to merge 468 commits into
masterfrom
release/2026-W24

Conversation

@github-actions

@github-actions github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Weekly release PR

prathameshkurunkar7 and others added 30 commits May 29, 2026 21:53
…ermission-issue

fix(auditor): Perm checks ignore when creating from release
Adds `Incident Settings Night Shift` child doctype to assign specific
users per day-of-week for night hours. During night (outside DAY_HOURS),
if a shift is defined for today, those users replace the default list.
Repeat calls use round-robin ordering (next after last acknowledged)
instead of calling the same person again. Falls back to default users
when no shift is defined for today or shift users are absent from the
main users table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds NIGHT_SHIFT_CALL_LIMIT (20). Once that many call attempts have
been made (tracked via the updates child table) without resolving the
incident, get_humans() skips night shift and falls back to the full
default user list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test was failing with AuthenticationError because running the
investigator foreground caused it to call get_prometheus_client()
without credentials. The underlying issue: running investigation
foreground also completed the investigator immediately, making
waited_enough_for_investigator_reactions return True before the first
resolve_incidents() check.

Fix: mock the investigator's frappe.enqueue_doc to a no-op so
investigate() is never queued, keeping the investigator in
Investigating state for the first resolve_incidents() check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Also extract call-per-human logic into _attempt_call_human to stay
within ruff C901 complexity limit.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…e-in-marketplace

fix(marketplace): Add 'Press User' role permissions
fix(deploy-ui): Show pre-build errors
fix: Guard add_resource and sync against cross-team docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the latest site update is cancelled, fetch the press notification
and display its message as a warning banner at the top of the updates list.
Uses a new extraResource hook in ObjectList to load secondary data lazily.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explain what happened and how to fix it in the backup restore,
app changes, and login-as-admin error toasts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Resolved all merge conflicts. Here's a summary of what was done:

  • dashboard/src/components/group/UpdateReleaseGroupDialog.vue – Kept the new patchDeploy resource (routes to Deploy Candidate) alongside the existing deploy resource (routes to Release Pipeline).
  • dashboard/src/components/navigation/sidebar/NavList.vue – Kept the activePartner-based routing for partner onboarding (redirects to /partner-onboarding for non-active partners), and preserved the condition: Boolean($team.doc.erpnext_partner) guard from master.
  • press/api/site.py – Took master's refactored function names (_validate_warranty_change, _is_free_dedicated_plan_allowed) and cleaner validate_plan structure, while combining HEAD's null plan_name guard to prevent crashes on sites without an active plan.
  • press/press/doctype/marketplace_app/test_marketplace_app.py – Kept new test methods and AppSource imports added in the release branch.
  • press/press/doctype/server/server.py – Kept the reason=reason parameter in frappe.enqueue_doc.
  • press/press/doctype/team/team.py – Kept the null check for sub.plan_type and sub.plan to avoid errors on incomplete subscriptions.

Commit: fix: resolve merge conflicts with origin/master

Copilot AI requested a review from saurabh6790 June 8, 2026 06:42
@codecov-commenter

codecov-commenter commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 41.01363% with 1385 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.33%. Comparing base (f242166) to head (e5d2e07).

Files with missing lines Patch % Lines
press/mcp/tools/documents.py 0.00% 307 Missing ⚠️
dashboard/src/objects/site.js 27.20% 205 Missing and 1 partial ⚠️
press/mcp/tools/actions/server.py 31.42% 144 Missing ⚠️
press/mcp/tools/actions/bench.py 28.77% 99 Missing ⚠️
press/mcp/tools/routes.py 0.00% 95 Missing ⚠️
...agent_investigation/support_agent_investigation.py 0.00% 94 Missing ⚠️
press/incident_management/support_agent/report.py 76.20% 64 Missing ⚠️
press/mcp/tools/codebase.py 0.00% 58 Missing ⚠️
dashboard/src/components/ObjectList.vue 54.54% 54 Missing and 1 partial ⚠️
...ss/incident_management/support_agent/collectors.py 78.18% 53 Missing ⚠️
... and 15 more

❌ Your patch status has failed because the patch coverage (41.01%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6635      +/-   ##
==========================================
+ Coverage   49.97%   50.33%   +0.36%     
==========================================
  Files         955      993      +38     
  Lines       79069    83514    +4445     
  Branches      374      523     +149     
==========================================
+ Hits        39511    42037    +2526     
- Misses      39532    41445    +1913     
- Partials       26       32       +6     
Flag Coverage Δ
dashboard 62.79% <37.69%> (+2.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

balamurali27 and others added 26 commits June 9, 2026 09:54
Previously, the investigation report for 500 and 502 errors would tell
the support agent to manually open web.error.log via the log browser.
This adds a collector that fetches and parses the log automatically.

The collector:
- Fetches web.error.log via the existing site.get_server_log() agent call.
- Parses gunicorn-format entries with a compiled regex, grouping each
  ERROR/CRITICAL line with the trailing traceback lines that follow it.
- Captures only the final exception message line from each traceback —
  not the full stack frames with local variables, which could carry
  personal information.
- Runs all entries through redact() before storing.
- Scans the last 500 lines and returns at most 10 error blocks.

The report generator classifies the collected errors into three patterns:
- OperationalError / "can't connect" → database connectivity failure.
- ImportError / ModuleNotFoundError → broken app state after deployment.
- CRITICAL level entries → worker crash or timeout.
- Anything else → generic exception, with the message surfaced.

The spec Non-Goals are updated to allow redacted exception message lines
(the final line of a traceback) as structured data while continuing to
exclude raw stack frames with local variables.

(cherry picked from commit 756bccf)
Extends get_site_performance_summary with anomaly detection and
custom-app identification to give 504 investigations more specific
signal before falling back to a generic "use Recorder" recommendation.

Changes:
- Fetches up to 20 endpoints (was 5) so custom-app paths are not
  crowded out by core Frappe endpoints in the ranking.
- Adds spike_detected per endpoint: peak >= 3x mean AND peak > 2 s.
  This surfaces endpoints that are occasionally very slow (e.g. a
  specific document type or a scheduled-job-triggered slow query) even
  when the 24-hour average is below the 1 s threshold.
- Adds is_custom per endpoint: extracts the Python module name from
  /api/method/<module>.* paths and checks whether that module belongs
  to a non-Frappe app. App origin is determined by repository_owner on
  the AppSource record; anything other than "frappe" is custom.
- Adds has_custom_apps to the summary so the report knows whether the
  bench has any non-Frappe apps at all.

The bench name is now passed from collect_site_context so the
app-source lookup can happen without an extra site query.

(cherry picked from commit 3a54fbd)
Splits _add_performance_evidence into three focused functions:
- _add_slow_endpoint_evidence: handles consistently-slow endpoints.
  When the slow endpoint belongs to a non-Frappe app it changes the
  cause to "custom app endpoints are slow — application-level" instead
  of the generic "web workers" cause.
- _add_spiky_endpoint_evidence: handles endpoints where peak >= 3x mean
  and peak > 2 s, adding evidence and suggesting Recorder to capture
  the specific triggering request.
- _add_performance_evidence: computes the slow/spiky lists and dispatches.

Both conditions are checked independently so an endpoint can be both
consistently slow and spiky (e.g. always 2 s but sometimes 30 s).

Adds four new tests:
- test_500_worker_timeout_in_web_log_flags_critical
- test_504_custom_app_endpoint_flagged_as_application_level
- test_504_spiky_endpoint_flagged_even_with_low_average
- test_504_frappe_endpoint_slow_flags_web_workers

(cherry picked from commit c94adf8)
Updates the 504 section to document the new endpoint analysis behavior:
app origin detection via AppSource.repository_owner, spike detection
(peak >= 3x mean and peak > 2 s), and how the report adjusts its cause
and next steps based on whether the slow endpoint is from a custom app.

Also updates the Collectors section to describe the enhanced
site_performance_summary fields: is_custom, spike_detected,
has_custom_apps, and the expanded 20-endpoint fetch window.

(cherry picked from commit 2e0716d)
Replace the two external API calls in the investigation collectors with
the shared HTTP clients from press/mcp:

- get_server_metrics: was calling press.api.server.prometheus_query, a
  whitelisted API wrapper that also does timezone conversion and label
  alignment we don't need. Now uses prometheus_get from
  press.mcp.tools.telemetry.clients directly. Add _prom_params (builds
  the query_range param dict) and _prom_values (extracts a flat list of
  floats from the Prometheus matrix response). Simplify _summarise_series
  to take list[float] instead of the datasets dict that prometheus_query
  returned.

- get_site_performance_summary: was calling get_request_by_ from
  press.api.analytics, which returns per-time-bucket datasets that we
  then manually averaged. Now uses elasticsearch_post with a terms
  aggregation that returns avg_duration_ms and max_duration_ms per path
  directly — same spike detection, fewer moving parts. Add
  _slow_endpoint_query (builds the ES body) and _parse_slow_endpoints
  (converts buckets to the endpoint dicts report.py expects).

No changes to press/mcp tooling. The decorated @press_mcp.tool functions
are not called.

(cherry picked from commit ac5c04c)
Add get_bench_process_status collector, which calls
Bench.supervisorctl_status() (the same agent call the MCP's
get_bench_processes tool makes) and returns a list of processes not in
Running or Starting state.

If the gunicorn web process is Fatal or Stopped, the report now flags it
as a direct cause of 502 errors rather than leaving the support agent
to discover it through logs. The next-step recommendation is to check
web.error.log and recent deployments before restarting — a bare restart
without diagnosis will recur.

Worker processes that are stopped (but the web process is fine) are
surfaced as evidence only, not a cause — stopped background workers
cause job failures, not 502s.

Two new tests: one for a Fatal gunicorn web process, one confirming
that all-running processes produce no process-level cause.

(cherry picked from commit eeb8c38)
Semgrep flagged `f == f` as a useless equality check. Switch to the
explicit `math` module check and add `import math`.

(cherry picked from commit 1a274c8)
Tests call collect_site_context → generate_report with prometheus_get
and elasticsearch_post returning controlled payloads. This verifies
the full transformation pipeline — Prometheus matrix response →
_prom_values → _summarise_series → report cause — rather than
constructing the payload dict directly.

frappe_mcp is not installed in test environments. The test file stubs
it in sys.modules at import time (before any press.mcp submodule is
touched by the patch machinery) so the import chain succeeds without
errors.

Six scenarios covered: CPU spike, flat CPU (no spike), uniformly slow
endpoint, spiky endpoint, stopped gunicorn web process, database
connectivity error in web.error.log.

(cherry picked from commit 7e46af6)
Mypy requires explicit annotations for dicts with heterogeneous nested
types that it cannot unambiguously infer. Add `: dict` to _PROM_EMPTY
and _ES_EMPTY.

(cherry picked from commit 35601d8)
Add an Anthropic API key field to Press Settings (Monitoring section).
Add a 'Get AI Analysis' button to the completed investigation form.
Clicking it sends the already-redacted payload to claude-sonnet-4-6
via the Anthropic Messages API (plain HTTP, no extra package required)
and stores the response in a new ai_response field.

The model only ever receives data that has already passed through the
redaction pipeline — no raw payloads, no personally identifiable data.
The controller validates that the investigation is Completed before
allowing the call.

(cherry picked from commit f94e4a2)
Add probe_success and probe_http_status_code from the blackbox
Prometheus exporter so the report can surface a DOWN probe or a 5xx
status code as a direct cause without waiting for log analysis.

Strip site.name from the payload copy before building the model prompt.
The model does not need the customer site identity to reason about
platform signals.

(cherry picked from commit bcfd9ad)
Document the site_uptime collector (blackbox probe_success and
probe_http_status_code), and replace the forward-looking model extension
point section with the actual AI Analysis implementation: what is sent
to the model, the two-stage privacy boundary (redact + anonymize), how
to configure the API key, and what the model is asked to produce.

Also add test_investigation to the verification commands.

(cherry picked from commit e673f38)
Extend _anonymise() to remove all platform identity fields from the
site and bench sections of the payload, not just the site name.

Stripped from site: name, bench, server, database_server, cluster, group.
Stripped from bench: name, server, database_server, cluster, candidate, build.

The model only needs status flags and metrics to reason about the issue.

(cherry picked from commit c6d9741)
Check that anthropic_api_key is set in Press Settings at the start of
run_ai_analysis(), before any payload is loaded or sent. Remove the
duplicate check from analyse() — validation belongs at the boundary.

The deterministic run() method never calls the model. AI analysis is
always an explicit manual action via the form button.

(cherry picked from commit cf62523)
The timeout message is internal — the support agent knows the context
and retry path. Suppress non-actionable-error-message inline.

(cherry picked from commit 4fc1085)
feat(support): Collect logs, metrics and process state in investigations (backport #6643)
Gunicorn's stderr is a bench-level file at
benches/{bench}/logs/web.error.log. Reading it via Site.get_server_log
was hitting the site-specific path (benches/{bench}/sites/{site}/logs/)
which either doesn't exist or is empty — so no errors were ever
collected.

This matters most for startup failures (syntax errors in app.py) where
gunicorn workers crash before any request is served, because those errors
land only in the gunicorn log and nowhere site-specific.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit a4b7374)
fix(support): Read web.error.log from bench, not site (backport #6646)
- Rectify mismatching instance types
- Add two additional plans present in db but absent in fixture

(cherry picked from commit b5ecd80)
fix(server-plan-fixtures): Sync server plans from prod (backport #6640)
(cherry picked from commit 8d1e382)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.