diff --git a/assets/img/monitoring-container-thread-pools.png b/assets/img/monitoring-container-thread-pools.png
index 6066cc495c..f90c5aa6c5 100644
Binary files a/assets/img/monitoring-container-thread-pools.png and b/assets/img/monitoring-container-thread-pools.png differ
diff --git a/assets/img/monitoring-dashboard.png b/assets/img/monitoring-dashboard.png
new file mode 100644
index 0000000000..94c882bedc
Binary files /dev/null and b/assets/img/monitoring-dashboard.png differ
diff --git a/assets/img/monitoring-health-indicators.png b/assets/img/monitoring-health-indicators.png
index 5186c9f67e..004fa4a87e 100644
Binary files a/assets/img/monitoring-health-indicators.png and b/assets/img/monitoring-health-indicators.png differ
diff --git a/assets/img/monitoring-jvm-memory.png b/assets/img/monitoring-jvm-memory.png
index 1390207d0e..7d8a8c9c63 100644
Binary files a/assets/img/monitoring-jvm-memory.png and b/assets/img/monitoring-jvm-memory.png differ
diff --git a/en/operations/monitoring.html b/en/operations/monitoring.html
index c6c65d9a8e..8077e067dd 100644
--- a/en/operations/monitoring.html
+++ b/en/operations/monitoring.html
@@ -6,7 +6,7 @@
- /en/cloud/monitoring.html
---
-
+
The Vespa Cloud Console has dashboards for insight into performance metrics, use the METRICS tab in the application zone view. @@ -28,43 +28,45 @@
-The dashboard is organized into seven tabs:
+The dashboard is organized into tabs for different purposes:
| Tab | What it shows | When to use it |
|---|---|---|
| Overview | +||
| Overview | Health indicators, request rates, QoS, latency summary, HTTP status codes, resource utilization | Daily health check, first stop during incidents |
| Query | +||
| Query | Container- and content-node query latency, per-rank-profile breakdown, match/docsum executors | Investigating read latency, query quality issues |
| Feed | +||
| Feed | Feed operation rates and latency at each layer, feed blocking | Investigating write latency or throughput issues |
| Nearest Neighbor Search | +||
| Nearest Neighbor Search | NNS distance computations, visit efficiency | -Tuning HNSW parameters (hidden when not in use) |
| Content Node | +Tuning HNSW parameters (hidden when not in use) | |
| Content Node | Document counts, Proton resource usage, executor utilization, maintenance jobs | Deep investigation of search engine internals |
| Resources | +||
| Resources | CPU, memory, disk, GPU, JVM, thread pools | Sizing and scaling decisions |
| Health | +||
| Health | Cluster state, data consistency, restarts, reindexing, resource limits | Stability monitoring, post-incident review |
Filters at the top apply across all tabs:
- Query, Feed, Content Node, Resources, and Health tabs group metrics per cluster — + Query, Feed, + Content Node, Resources, + and Health tabs group metrics per cluster — you see all metrics for one cluster before scrolling to the next. Container metrics are grouped per container cluster, content metrics per content cluster.
@@ -83,7 +85,7 @@
- The Overview tab opens with a dedicated Health Indicators row — - five stat panels designed to surface stability issues in a single glance. - A row of green zeros is the signal to stop; a non-zero value tells you which tab to visit next. + The Overview tab opens with a dedicated Health Indicators row, + organized into three themed sub-rows. A row of green tiles is the signal to stop; + a non-zero value (or low Headroom) tells you which tab to visit next.
+ +| Indicator | What it counts | Healthy value | |
|---|---|---|---|
| Core Dumps (1h) | Core dumps processed across all clusters in the last hour | -0 — any non-zero value is a crash to investigate | 0: any non-zero value is a crash |
| Restarts (1h) | -Vespa service restarts across all clusters in the last hour | +Vespa service restarts across all clusters in the last hour. The underlying
+ sentinel_totalRestarts metric is cumulative since the sentinel
+ started; the "1h" window is computed by the panel via
+ delta(...[1h]) > 0. The > 0 filter discards
+ negative deltas that occur when the sentinel itself restarts and the counter
+ resets (a reset implies a restart happened, but the count within the reset
+ frame is unrecoverable). Same shape is used by the Core Dumps (1h)
+ tile. |
0 during steady state; brief spikes are normal during upgrades |
| Feed Blocked | Nodes currently above a feed-block resource limit | -0 — non-zero means writes are being rejected cluster-wide | 0: non-zero means writes are being rejected cluster-wide | + +
| Indicator | What it counts | Healthy value |
|---|---|---|
| Container: % Nodes Down | +Active container nodes where some service isn't running | +0 during steady state; brief spikes during deployments are expected |
| Content: Groups/Nodes Down | Content groups with at least one node down | 0 during steady state. 1 group down is normal during rolling restarts or maintenance; 2 or more should be investigated |
| Container: Services Down | -Active container nodes where some service isn't running | -0 during steady state; brief spikes during deployments are expected |
+ These tiles surface per-cluster saturation signals — values close to the + threshold mean the corresponding tab needs investigation now, not after the next outage. + The thread saturation tiles only render for the container configuration cases that exist + in your deployment (see Container thread pools below). +
+| Indicator | What it counts | Healthy value |
|---|---|---|
| Headroom to Feed Block (per content cluster) | +Remaining headroom before the feed-block limit, taken as the minimum across memory and disk (1 − usage ÷ limit) | +≥ 10% (green): healthy. 5–10% orange = plan capacity. < 5% or ≤ 0 = act now / cluster is feed-blocked |
| Content Executor Saturation (per content cluster) | +Worst-case utilization across the Proton executors most relevant to latency: match, docsum, field-writer (utilization and saturation) | +< 80% (green); 80–95% orange = queries / feed will start queueing; ≥ 95% red = action needed |
| Container Thread Saturation — search + document-api | +Per container cluster (with both <search> and <document-api>): worst active / size ratio across all JDisc thread pools |
+ < 80% (green); 80–95% orange; ≥ 95% red: search-handler saturation directly degrades query latency |
| Container Thread Saturation — search only | +Same as above, for clusters with only <search> |
+ Same thresholds (80% / 95%): latency-critical |
| Container Thread Saturation — document-api only | +For clusters with only <document-api> |
+ < 90% (green); 90–98% orange; ≥ 98% red: later warning since feed delays don't surface as user-visible query failures |
| JVM Heap Pressure (per container cluster) | +Heap used ÷ heap capacity, averaged across hosts in the cluster | +< 70% (green); 70–85% orange; ≥ 85% red. Lights up before Core Dumps or Restarts do — the leading indicator for OOM/forced-restart risk |
+ Note: Headroom to Feed Block inverts the usual reading — higher is better. + Its underlying metric aggregates across all storage nodes including those in maintenance + or retired state, so headroom can show below 5% on a cluster that isn't actually + feed-blocked. Cross-reference the Feed Blocked tile + (which only counts in-service nodes) for ground truth. +
+QoS (Quality of Service) shows the percentage of successful requests. @@ -187,9 +247,9 @@
The Query Quality row shows:
+ These panels split per-query cost across the four phases of a matching query. + An operator triaging high content-side CPU benefits from reading them in pipeline order + rather than panel-by-panel: +
++match → first-phase rank → second-phase rerank → grouping & result construction ++
Cost model per phase:
+docs_matched × per-doc first-phase rank cost. Drive it down with
+ tighter recall (weakAnd, filter operators, nearestNeighbor
+ selectivity).total-rerank-count × second-phase expression complexity.
+ Drive it down with cheaper rank features or a smaller
+ total-rerank-count
+ (cluster-wide cap; preferred over the per-node
+ rerank-count,
+ but both remain supported — and rerank-count is still the canonical knob
+ for global-phase).
+ Soft-doom signals are the outcome of all of the above.
+ soft_doomed_queries counts queries that ran out of their soft timeout and
+ returned partial results; soft_doom_factor is an adaptive multiplier
+ (starts at 0.5, ticks ±0.01/0.02 per query depending on whether the query
+ finished under its soft timeout) that Vespa uses to shrink the per-query
+ deadline when queries are consistently overrunning. If soft-doom is firing, drill into
+ setup / rerank / grouping time on the same profile to find the overrunning phase.
+
+ Three of the timing metrics measure something slightly different from what their name + suggests: +
+query_setup_time — is computed as
+ total_matching_time − queryLatencyAvg
+ (matcher.cpp:354). It covers both pre-match work
+ (query decode, blueprint build, rank-context setup) and
+ post-match finalisation around the matching phase — not purely
+ pre-matching time.grouping_time — is computed as
+ query_time − match_time (match_master.cpp:121).
+ Despite the name, it covers everything that happens after matching ends
+ on the content node: grouping operator execution, result construction, and
+ final packing — not just grouping.soft_doom_factor — is an adaptive multiplier
+ (range 0.01 to ∞, starting at 0.5), not a remaining-budget fraction.
+ Higher values give queries more time budget; lower values mean Vespa has shortened
+ deadlines because queries kept overrunning.Things to look for:
+
+ The docs_matched rate is mostly a proxy for first-phase ranking
+ work, but a few mechanisms can skip ranking for matched documents:
+ rank-score-drop-limit drops low-scoring docs (still counted as matched),
+ match-phase limiting can cap how many docs reach ranking, and threads
+ hitting soft-timeout mid-loop never rank the remainder of their range.
+
- See Latency tracking below for a worked example, - and the - rank profiles documentation for background. + Each panel's hover tooltip carries impact when high and + investigate hints, and points sideways to the next likely panel to drill + into. See Latency tracking below for a worked example, + the rank profiles documentation + for background, and the + Practical search + performance guide for tuning recipes.
- The Query tab also includes Match Executor and Docsum Executor sub-rows - (queue size + accepted rate) so you can see whether the content-node thread pools - feeding the query and summary paths are saturated. These are not attributable to a - rank profile, but often explain tail-latency spikes that aren't visible in rank-profile metrics. + The Query tab also includes Match Executor and Docsum Executor sub-rows so you can see + whether the content-node thread pools feeding the query and summary paths are saturated. These are + not attributable to a rank profile, but often explain tail-latency spikes that aren't visible in + rank-profile metrics. +
+The Docsum Executor row carries four panels per content cluster:
+| Panel | What it shows | Read it together with |
|---|---|---|
| Docsum executor queue size (max) | +Peak length of the per-node docsum thread-pool queue. Sustained non-zero means tasks + are arriving faster than they can be drained. | +Docsum latency: queue depth and latency rise together when the pool is the bottleneck. | +
| Docsum executor accepted (rate) | +Throughput at the front door: tasks scheduled per second. One task = one summary + document to render. | +Document summaries requested (rate): accepted vs. completed. | +
| Docsum latency | +Avg (steady-state) and max (per-host worst) time to render a summary. Cost grows with
+ summary class size, number of summary fields, and match-features
+ / summary-features that recompute at docsum time. |
+ Queue size: rising latency with rising queue points at executor saturation. | +
| Document summaries requested (rate) | +Throughput at the back door: renderings completed per second. Derived from + the docsum latency sample count over the snapshot interval. | +Docsum executor accepted (rate): sustained accepted > completed lines up with + growing queue depth and rising docsum latency. | +
+ Docsum cost is not attributable to a single rank profile, so investigate the overall
+ query mix — expensive summary classes, large hits counts, or
+ match-features / summary-features lists that force per-hit feature
+ recomputation.
+
+ Docsum reads summary fields from the + document store. When those reads miss + the document-store cache they become disk reads, which surface as + CPU IOWait on the Resources tab — so a high + Document summaries requested (rate) combined with a low + Document Store Cache Hit Rate is the typical cause of IOWait on a + search cluster with no active feed.
@@ -259,12 +435,15 @@Start from the top and find where latency increases. If container feed latency is normal but HTTP write latency is high, the bottleneck is network/payload. If distributor latency is high, check for node state issues in the Health tab. - If storage latency is high, check disk I/O in the Resources tab.
+ If storage latency is high, check the + Persistence Engine row to see whether the bottleneck is + the storage backend queue or the concurrency throttle, and disk I/O in the Resources tab.
+ Latency and rate panels tell you that feed slowed down; the Persistence Engine row
+ in the Feed tab tells you which layer on the content side is capping throughput.
+ It carries two panels, both sourced from vds.filestor.* metrics on the searchnode.
+
| Panel | Metric | What it reports |
|---|---|---|
| Persistence engine input queue (avg + max) | +vds.filestor.queuesize |
+
+ Count of ops waiting in the per-stripe input queues before a persistence thread picks
+ them up. The metric is the sum across stripes, published via
+ _metrics->queueSize.addValue(getQueueSize())
+ (filestorhandlerimpl.cpp:341).
+ |
+
| Persistence engine throttle saturation | +
+ vds.filestor.active_operations.size vs.
+ vds.filestor.throttle_window_size
+ |
+
+ active is the count of ops currently in-flight (incremented on dispatch,
+ decremented on completion in active_operations_stats.cpp:92-98).
+ throttle window is the current capacity of the
+ SharedOperationThrottler, dynamically adjusted by Proton.
+ |
+
How to read the two panels together:
+
+ There is no equivalent in-flight-ops gauge on the distributor — the distributor exposes
+ per-operation success/failure counts and a memory-usage gauge for active mutating operations
+ (mutating_op_memory_usage), but not a count of pending ops. Storage-side metrics
+ are the canonical place to look for feed-throughput ceiling questions.
+
+ Cluster-level only: both panels group by clusterid and not
+ per-host, because the underlying vds.filestor.*_max metrics have
+ host stripped by adaptive-metrics. The cluster max-over-time still answers
+ the bottleneck question (if cluster-max active sits at cluster-max
+ throttle window, at least one node is throttle-bound), but it does not identify
+ which node. See A note on metric cardinality
+ below.
+
+ This section applies only to the Vespa Cloud metrics tier. + Vespa Cloud applies an internal cardinality-reduction policy when storing tenant + metrics, which drops certain high-cardinality labels from selected metric families + to keep storage cost manageable. Self-hosted Vespa is not affected — Vespa + itself emits the full label set; the limitations described below are a property of + how Vespa Cloud stores the metrics downstream, not of Vespa. +
+The policy has two observable shapes that affect what dashboard panels can group by:
+host and related
+ node-identity labels) are stripped. Panels using these metrics can only group by
+ clusterid, not per-host.content.proton.documentdb.feeding.commit.latency_* and
+ content.proton.documentdb.index.memory_usage.allocated_bytes_average,
+ the documenttype label is dropped while host is kept.
+ Panels using these metrics can group by host but not by document type.+ Metrics not covered by the policy ship at full cardinality with all labels intact. The + panels below were chosen so the available labels match what the panel needs to show. + Where a metric family is partially restricted, only the variants that work are graphed, + and each panel description carries a footnote pointing to the dropped label so future + readers know why a breakdown they expect to see isn't there. +
+ +
+ On multi-doctype applications, cluster-aggregate Content: Commit Operations can hide
+ the fact that one doctype is dragging write volume. The Per-Document-Type Feed row in
+ the Feed tab carries the operations rate split by the documenttype dimension,
+ which is inherited from the documentdb metric set
+ (documentdb_tagged_metrics.cpp:253 —
+ MetricSet("documentdb", {{"documenttype", docTypeName}}, ...)).
+
| Panel | Metric | What it reports |
|---|---|---|
| Content: Commit Operations (per document type) (rate) | +content.proton.documentdb.feeding.commit.operations_rate |
+ Number of operations included in commits, per doctype, expressed as ops/s. | +
+ No commit-latency-per-doctype panel: in Vespa Cloud the
+ documenttype label is dropped from
+ content.proton.documentdb.feeding.commit.latency_{sum,count,max}
+ by the cardinality-reduction policy, so the
+ per-doctype split is not available there. The cluster-aggregate view is in
+ Content: Commit Latency (avg) in the row above.
+
+ New writes are first applied to an in-memory index per document type; the
+ memory_index_flush maintenance job periodically flushes that memory index to disk.
+ When write rate exceeds the flush rate, the memory index grows monotonically — a leading
+ indicator that feed-rate stability is at risk.
+
| Panel | Metric | What it reports |
|---|---|---|
| Memory Index — Documents (per document type) | +content.proton.documentdb.index.docs_in_memory_max |
+ + Count of documents currently in the in-memory index, per doctype. Healthy pattern is a + sawtooth (rises during feed, drops on each flush). Monotonic rise = the flush is the bottleneck. + | +
+ Multi-group correctness: docs_in_memory is emitted per content
+ node and replicated identically across content
+ groups, so the panel uses the
+ same max by(…) (sum by(…, groupId) (…)) pattern as the
+ Content Node document-count panels: inner
+ sum by(clusterid, documenttype, groupId) gives the per-group total of
+ documents pending flush; outer max by(clusterid, documenttype) picks the leading
+ group. The inner aggregation must be sum; in Vespa Cloud this metric is
+ registered with only the sum aggregation, so direct max by(…)
+ on the stored metric is not valid.
+
+ No memory-index-bytes-per-doctype panel: in Vespa Cloud the
+ documenttype label is dropped from
+ content.proton.documentdb.index.memory_usage.allocated_bytes_average
+ by the cardinality-reduction policy. The
+ document-count panel above carries the same leading-indicator signal and is the cleaner
+ one to track anyway.
+
+ Cross-reference with the Maintenance Job Activity panel on the Resources tab: if
+ memory_index_flush activity sits at 1.0 (always flushing) while the in-memory
+ count keeps rising, flushing cannot keep up — add nodes or reduce per-doctype feed rate.
+
+ The distributor tracks per-operation failure subtypes (busy, timeout,
+ storagefailure, etc. — declared in DistributorMetrics.java and
+ populated in persistence_operation_metric_set.cpp:98-118), and the
+ busy subtype is the canonical backpressure signal: it counts operations the
+ distributor dropped because the storage node returned BUSY (typically because the persistence
+ engine's throttle window was full). However, Vespa9VespaMetricSet only exports the
+ aggregated failures.total across all subtypes — the per-subtype counters
+ (including busy) are not currently surfaced in Vespa Cloud.
+ The Distributor Operation — Failures panel therefore aggregates busy together with
+ every other failure mode. If you see total failures rising during heavy feed, the most likely
+ contributor is busy; confirm by looking at the
+ Persistence Engine row — if
+ active is sitting at throttle window, that is what is generating the BUSY
+ responses upstream.
+
@@ -301,21 +668,21 @@
Vespa supports two NNS modes:
approximate-threshold (default 0.02).Key metrics:
@@ -335,11 +702,35 @@
+ Multi-group correctness: each content
+ group holds a full replica
+ of the data, so the underlying content.proton.documentdb.documents.* metrics
+ (emitted per node, partitioned within a group) are repeated identically across groups.
+ Summing the per-node values over the whole cluster would over-count by a factor equal to
+ the number of replica groups. The dashboard panels (Documents,
+ Documents Ready, Documents Active by document type) therefore use the
+ pattern:
+
+max by(clusterid[, documenttype]) (
+ sum by(clusterid[, documenttype], groupId) (
+ content_proton_documentdb_documents_kind_max{...}
+ )
+)
+
+
+ The inner sum by(…, groupId) gives the per-group total; the outer
+ max by(…) picks the leading group (groups should be equal once
+ converged; max highlights one when it’s lagging behind a deploy or
+ rebalancing). Single-group clusters degenerate cleanly — the inner aggregation
+ becomes one series per cluster, and the outer max is the identity.
+
@@ -351,12 +742,23 @@
Proton uses several thread pools (executors):
+ Note the semantics: content.proton.executor.<name>.utilization is
+ a time-fraction — the share of the reporting interval the worker
+ threads were busy — not an instantaneous active ÷ size ratio. So
+ utilization 0.80 means "the workers were busy 80% of the time over the last interval",
+ which approximates "8 out of 10 threads busy on average" but doesn't require any
+ particular thread-count instantaneously. The metric is therefore bounded in
+ [0, 1] for single-threaded executors; pools that run multiple parallel tasks per
+ worker can expose saturation as a separate metric (only
+ field_writer does today) which can exceed 1.0 once tasks queue.
+
Typical healthy values:
- The dashboard renders avg as a solid green line and max as a dashed yellow line, + The dashboard renders avg as a solid green line and max as a dashed orange line, making it easy to spot whether the maximum tracks the average or has concerning spikes.
@@ -380,9 +782,9 @@cpu_iowait_pct — host-level
+ metric from the node-admin exporter, not Vespa-emitted)+ The threshold-coloured panels across the dashboard use a consistent + green / orange / red scheme with the breakpoints tuned per metric family. + Below is a quick cross-reference; each panel's tooltip restates the exact values + for its specific signal. +
+| Saturation type | Orange (warning) | Red (action) |
|---|---|---|
search-handler thread pool & queue util | 90% | 95% |
Other thread pools (default-handler, feedapi-handler) | 80% | 90% |
| Content executors (Match / Shared / Field Writer) | 80% | 95% |
| CPU utilization (node) | 70% | 85% |
| Memory utilization (node) | 80% | 90% |
| Memory / Disk vs. feed-block limit (content) | 70% | 80% |
| Disk utilization (node) | 70% | 80% |
| JVM GC overhead | 5% | 15% |
| JVM Heap Pressure (used / capacity, per container cluster) | 70% | 85% |
| Headroom to Feed Block (inverted: low = bad) | 5–10% headroom | < 5% or ≤ 0 |
+ The 90% / 95% thresholds for search-handler match the warning levels
+ the engine itself logs internally
+ (SearchHandler.java, see monitorThreadCount).
+ The live dashboard includes a Dashboard conventions text panel in the Overview
+ Information row that documents the same colour scheme — keep this page in sync
+ with that panel if conventions change.
+
+ IOWait is the share of CPU time spent idle while there is at least one outstanding + disk I/O request. It is a disk signal — network waits do not count. + Two paths drive IOWait on content nodes: +
++ On a search-only cluster with no active feed and persistent IOWait, the usual cause + is the query path: a high Document summaries requested (rate) combined with + a low Document Store Cache Hit Rate. Worth checking together: +
+
+ Mitigations on the query side: shrink summary classes, drop match-features
+ / summary-features that aren't consumed, or grow memory so more of the
+ document store stays cached. On the feed side: spread flushes (more, smaller flushes),
+ add nodes to reduce per-node feed pressure, or move maintenance to off-peak windows.
+
Which thread pools exist on a container depends on which elements are configured @@ -437,23 +908,27 @@
<search> but no feed API<search> but no feed APIClassification is automatic: hidden variables derive the cluster list per case, so only relevant rows render for a given deployment. Each pool gets three panels — Utilization, Work Queue Size, Work Queue Utilization — - with avg as a solid green line and max as a dashed yellow line. + with avg as a solid green line and max as a dashed orange line.
search-handler pool, core size == max size (fixed-size pool),
+ so a value approaching 1.0 directly maps to "all threads busy".
+ Other pools may grow on demand, in which case the ratio resets as the pool
+ expands.
The Resources tab's JVM row separates the three layers of container memory:
@@ -472,6 +947,38 @@
+ When Vespa's automatic estimate of inference memory is wrong (typically: under-estimated
+ for a large local LLM, leading to OOM), cap it explicitly with the
+ <inference><memory>
+ element in services.xml. The value is reserved up-front for both model
+ weights and inference requests.
+
+ The JVM row also includes two GC panels: +
+jvm_gc_overhead_max (Micrometer's JvmGcMetrics). Note this
+ is CPU time, not wall-clock; on an oversubscribed host the two can diverge.jvm_gc_pause_{sum, count, max}.
+ This is what directly translates to user-visible latency spikes when GC pauses
+ are long. Distinct from GC Overhead, which can be low even when individual
+ pauses are problematically long (and vice versa).+ Average and peak HTTP requests served per TCP/HTTP connection over its lifetime, + rendered in the Resources tab's Network sub-row alongside Open Server Connections + and Network Throughput. High values indicate HTTP keep-alive is working — + clients reuse connections and avoid the TCP/TLS handshake cost on each request. + A value near 1 means connections close after every request, often due to client-side + configuration or short-lived clients. The metric is sampled when each connection closes + and cannot be split by read vs. write — a single connection can serve mixed traffic. +
@@ -501,9 +1008,9 @@
Both signals surface in three complementary ways: as per-cluster time series on this tab