From 9352e51bcec4fe323afe37f67014a72b909f281a Mon Sep 17 00:00:00 2001 From: zanejohnson-azure Date: Thu, 14 May 2026 15:30:56 -0700 Subject: [PATCH] Add multiline-validation skill Codifies the procedure for validating multi-line log stitching across an ama-logs image change. The skill drives an A/B comparison: applies a multiline-enabled configmap, deploys the OLD (production) image and captures per-language stitching metrics, deploys the NEW (test) image and re-captures the same metrics, then compares MaxLen and stitched-vs-single ratios per language and OS to detect parser regressions. Lives next to the existing backdoor-deployment skill under .github/skills/ and is reusable for future fluent-bit upgrades, parser config edits, or output plugin changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/skills/multiline-validation/SKILL.md | 197 +++++++++++++++++++ 1 file changed, 197 insertions(+) create mode 100644 .github/skills/multiline-validation/SKILL.md diff --git a/.github/skills/multiline-validation/SKILL.md b/.github/skills/multiline-validation/SKILL.md new file mode 100644 index 000000000..6d31a8581 --- /dev/null +++ b/.github/skills/multiline-validation/SKILL.md @@ -0,0 +1,197 @@ +--- +name: multiline-validation +description: "Validate multi-line log stitching behavior for an ama-logs image change. Enables multiline in the configmap, deploys the OLD (production) image, captures stitching baselines, deploys the NEW (test) image, captures the same metrics, and produces an A/B comparison per language and OS. Use when: validating a fluent-bit upgrade, validating a parser/configmap change, comparing multiline stitching between two images, multi-line A/B test, stacktrace stitching test." +argument-hint: "Provide cluster name, OLD image tag, NEW image tag, and helm release name" +--- + +# Multi-line Log Stitching A/B Validation + +Validates that an ama-logs image change preserves (or improves) multi-line log stitching behavior across Java, Python, Go, and .NET stack traces on both Linux and Windows. Produces a per-language, per-OS A/B comparison table that shows whether the NEW image produces the same row counts, max-lengths, and stitched-vs-single ratios as the OLD image. + +This skill is **complementary to backdoor-deployment** — that skill validates aggregate data volume and resource consumption; this one validates the multi-line parser pipeline specifically. Run both when an image change can affect log parsing (fluent-bit upgrade, parser config edit, output plugin change). + +## Required Inputs + +Confirm with the user; suggest defaults from the most recent run if available. + +| Input | Description | Example | +|-------|-------------|---------| +| **Cluster name** | AKS cluster with Linux + Windows nodepools | `zane-ama-logs-helm-test` | +| **OLD image tag** | Current production image | `ciprod:3.3.0` (Linux) / `ciprod:win-3.3.0` (Windows) | +| **NEW image tag** | Test image from CI build | `cidev:3.3.0-6-g1d77401ab-20260506045747` | +| **Helm release name** | Helm release for ama-logs on the cluster | `azuremonitor-containers` | +| **Helm release namespace** | Usually `default` for the prod chart | `default` | + +## Derived Values + +Parse from `charts/azuremonitor-containerinsights-for-prod-clusters/values.yaml` — do not ask the user. + +| Value | Source | +|-------|--------| +| **Cluster Resource ID** | `OmsAgent.aksResourceID` | +| **Log Analytics Workspace ID** | `OmsAgent.workspaceID` | +| **Subscription ID / Resource Group** | Extracted from cluster resource ID | + +## General Rules + +- Save the output of **each step** to `MultilineValidationOutput.md` in the repo root. Always append; never clear unless explicitly asked. +- The **configmap is the controlled variable** — apply it once, then leave it alone for the entire run. If the configmap changes between OLD and NEW snapshots, the comparison is invalid and must be redone. +- Use the **same multiline test job set** for both snapshots. Re-deploy fresh job runs after each image swap so log windows are clean. +- Wait **at least 12 minutes** after each image deploy before querying ContainerLogV2 (pod restart + ingestion latency). +- Restore `values.yaml` and remove the test configmap from the cluster at the end (unless the user wants to keep them). + +## Procedures + +### Apply Multiline Configmap + +The skill ships its own configmap so behavior is deterministic. Source: `test/scenario/multiline/container-azm-ms-agentconfig.yaml` if present, otherwise generate inline: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: container-azm-ms-agentconfig + namespace: kube-system +data: + log-data-collection-settings: |- + [log_collection_settings] + [log_collection_settings.stdout] + enabled = true + [log_collection_settings.stderr] + enabled = true + [log_collection_settings.enable_multiline_logs] + enabled = "true" + stacktrace_languages = ["java", "python", "dotnet", "go"] +``` + +Apply: `kubectl apply -f ` + +Restart both daemonsets so the new config takes effect: +```bash +kubectl rollout restart ds/ama-logs ds/ama-logs-windows -n kube-system +kubectl rollout status ds/ama-logs -n kube-system --timeout=180s +kubectl rollout status ds/ama-logs-windows -n kube-system --timeout=180s +``` + +### Deploy Multiline Test Jobs + +The repo ships eight job manifests under `test/scenario/multiline/` covering Java, Python, Go, and .NET on both Linux and Windows. Each job emits a mix of single-line app logs and multi-line stack traces in a loop. + +```bash +kubectl create namespace tenant1 --dry-run=client -o yaml | kubectl apply -f - +kubectl delete jobs -n tenant1 --all +Get-ChildItem test/scenario/multiline/*.yaml | ForEach-Object { kubectl apply -f $_.FullName } +kubectl get jobs -n tenant1 +``` + +Re-run this block after each image swap so each snapshot has a clean log window. + +> **Windows nodepool note**: Windows test pods require an `ltsc2022` nodepool. The shipped yamls use `mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022` and rely on AKS image-OS scheduling — do not add a hard-coded `nodeSelector`. + +### Update Image Tags and Deploy + +1. Edit `charts/azuremonitor-containerinsights-for-prod-clusters/values.yaml`: + - `imageRepository: "/azuremonitor/containerinsights/"` (`ciprod` for OLD, `cidev` for NEW) + - `imageTagLinux: ` + - `imageTagWindows: ` +2. Helm upgrade against the existing release name (do not use `--install` with a different release name — it will fail on owned ServiceAccounts): + ```bash + helm upgrade ./charts/azuremonitor-containerinsights-for-prod-clusters -n + ``` +3. Record deploy time in UTC (`Get-Date -Format 'u'` or `(Get-Date).ToUniversalTime().ToString('yyyy-MM-ddTHH:mm:ssZ')`). +4. Wait for rollouts: + ```bash + kubectl rollout status ds/ama-logs -n kube-system --timeout=180s + kubectl rollout status ds/ama-logs-windows -n kube-system --timeout=180s + ``` +5. Verify the running image: + ```bash + kubectl get ds ama-logs -n kube-system -o jsonpath="{range .spec.template.spec.containers[*]}{.name}={.image}{'\n'}{end}" + kubectl get ds ama-logs-windows -n kube-system -o jsonpath="{.spec.template.spec.containers[0].image}" + ``` +6. **Wait 12 minutes** before querying. + +### Query Stitching Metrics + +Run the per-language stitching KQL via `az monitor log-analytics query -w `: + +```kusto +ContainerLogV2 +| where TimeGenerated >= datetime('') +| where _ResourceId =~ '' +| where PodNamespace == 'tenant1' +| extend Msg = tostring(LogMessage) // CRITICAL: dynamic to string +| extend Lines = countof(Msg, '\n') + 1 +| extend OS = iif(ContainerName endswith 'win', 'Win', 'Linux') +| extend Lang = replace_string(ContainerName, '-win', '') +| summarize + Rows=count(), + MaxLen=max(strlen(Msg)), + MaxLines=max(Lines), + Stitched=countif(Lines>1), + Single=countif(Lines==1) + by Lang, OS +| order by Lang asc, OS asc +``` + +Save the resulting 8-row table (Lang × OS) to the output file under a clearly labeled section (`### OLD image snapshot` or `### NEW image snapshot`). + +### Compare A/B + +Build a single side-by-side table with one row per (Lang, OS) and these columns: + +| Lang | OS | OLD Rows | OLD Stitched | OLD Single | NEW Rows | NEW Stitched | NEW Single | OLD MaxLen | NEW MaxLen | Verdict | + +**Pass criteria** (per row): +1. `MaxLen` matches exactly between OLD and NEW. A change here means the longest stitched record changed → parser regression. +2. `Stitched / (Stitched + Single)` ratio matches within ±2% between OLD and NEW. A drop means stitching is failing for some headers. +3. Absolute `Rows` count is **not** required to match — different snapshot windows naturally produce different totals. + +**Failure investigation**: when a row fails, drill into the specific (Lang, OS) by sampling rows and inspecting `LogMessage`. Compare the actual stitched output between OLD and NEW for the same source app log shape. Look for header regex changes, continuation regex changes, or new fluent-bit defaults. + +### Cleanup + +1. Delete the test namespace: `kubectl delete namespace tenant1 --wait=false` +2. (Optional) Remove the multiline configmap if the cluster shouldn't keep it: `kubectl delete configmap container-azm-ms-agentconfig -n kube-system` +3. Restore `values.yaml` placeholders: + - `imageRepository: "/azuremonitor/containerinsights/ciprod"` + - `imageTagLinux: ` + - `imageTagWindows: ` + - Restore any region/cloud placeholders that were swapped during deployment. +4. Final summary in `MultilineValidationOutput.md`: pass/fail per row, image tags compared, deploy timestamps, and any investigation findings. + +## Steps + +### Phase 1: Setup (once) + +1. Confirm inputs with the user (or use most recent run defaults). +2. Set kubectl context: `kubectl config use-context `. +3. Apply the multiline configmap and restart both daemonsets (see "Apply Multiline Configmap"). +4. Verify multiline parsers are engaged inside the Linux pod: + ```bash + kubectl exec -n kube-system -c ama-logs -- cat /etc/opt/microsoft/docker-cimprov/fluent-bit.conf | grep -i multiline + ``` + Expect a `[FILTER] Name multiline` block with `multiline.parser` listing the configured languages. + +### Phase 2: OLD image snapshot + +5. Update `values.yaml` to the OLD image and helm-upgrade (see "Update Image Tags and Deploy"). Record OLD deploy time. +6. Verify pods running and image tag matches expectation. +7. Deploy / re-deploy the multiline test jobs (see "Deploy Multiline Test Jobs"). +8. Wait 12 minutes. +9. Run the stitching KQL (see "Query Stitching Metrics"). Save as `### OLD image snapshot`. + +### Phase 3: NEW image snapshot + +10. Update `values.yaml` to the NEW image and helm-upgrade. Record NEW deploy time. +11. Verify pods running and image tag matches expectation. +12. Re-deploy the multiline test jobs to start a clean window. +13. Wait 12 minutes. +14. Run the stitching KQL again. Save as `### NEW image snapshot`. + +### Phase 4: Compare and report + +15. Build the side-by-side comparison table (see "Compare A/B"). +16. Apply the pass criteria. For any failing row, investigate and document. +17. Cleanup (see "Cleanup"). +18. Write final pass/fail verdict to `MultilineValidationOutput.md`.