Skip to content

Cluster diagnostics reference

Overview

GET /api/diagnostics/cluster returns a real-time snapshot of the Agentweaver Kubernetes cluster: component health, namespace quota, active and orphaned agent-host pods, and subtasks waiting for capacity.

This endpoint is only available in AKS deployments. Non-AKS deployments return 404 Not Found.

For the user-facing Cluster page guide see Cluster page. For the API endpoint table see API reference → Workspace, diagnostics, and metrics.

Authentication

Standard bearer-token authentication is required. See API reference → Authentication.

Response — ClusterDiagnosticsDto

200 OKapplication/json

json
{
  "component_health": [
    {
      "name": "postgres",
      "status": "pass",
      "detail": null,
      "duration_ms": 12
    },
    {
      "name": "github_installation_token",
      "status": "pass",
      "detail": null,
      "duration_ms": 230
    },
    {
      "name": "key_vault",
      "status": "pass",
      "detail": null,
      "duration_ms": 45
    },
    {
      "name": "agent_pod_quota",
      "status": "warn",
      "detail": "CPU headroom: 1.2 cores (threshold: 2 cores)",
      "duration_ms": 38
    },
    {
      "name": "warm_pool",
      "status": "pass",
      "detail": null,
      "duration_ms": 22
    },
    {
      "name": "kubernetes_api",
      "status": "pass",
      "detail": null,
      "duration_ms": 8
    }
  ],
  "namespace_quota": {
    "cpu_used": 3.8,
    "cpu_total": 5.0,
    "memory_used_gi": 6.4,
    "memory_total_gi": 10.0
  },
  "active_agent_pods": [
    {
      "pod_name": "agent-host-abc123",
      "run_id": "f36800fd-f2f8-418c-958e-aae3e4921ba6",
      "node": "katapool-vm-nodepool1-12345678-0",
      "started_at": "2026-06-27T17:55:00Z"
    }
  ],
  "orphaned_agent_pods": [],
  "pending_capacity_runs": [
    {
      "coordinator_run_id": "coord-abc123-...",
      "subtask_id": 7,
      "pending_since": "2026-06-27T17:58:30Z",
      "retry_count": 3
    }
  ],
  "warm_pools": [
    {
      "name": "agentweaver-sandbox",
      "desired_replicas": 3,
      "ready_replicas": 3,
      "available_replicas": 3,
      "status": "healthy",
      "age_seconds": 86400
    }
  ],
  "sandbox_objects": [
    {
      "name": "sandbox-abc123",
      "phase": "standby",
      "ready": true,
      "pod_name": "sandbox-abc123-pod",
      "template_ref": "agentweaver-sandbox-template",
      "warm_pool": "agentweaver-sandbox",
      "age_seconds": 3600
    }
  ],
  "sandbox_claims": [
    {
      "name": "sandboxclaim-xyz789",
      "phase": "bound",
      "ready": true,
      "run_id": "f36800fd-f2f8-418c-958e-aae3e4921ba6",
      "bound_sandbox": "sandbox-abc123",
      "warm_pool": "agentweaver-sandbox",
      "age_seconds": 120
    }
  ]
}

404 Not Found — Cluster diagnostics are not available (non-AKS deployment).

Fields

Top-level

FieldTypeDescription
component_healthComponentHealthDto[]Results of 6 concurrent health checks. Each check has a 5-second timeout.
namespace_quotaNamespaceQuotaDtoCurrent CPU and memory consumption vs. the namespace limits. null if quota could not be read.
active_agent_podsAgentPodInfoDto[]Agent-host pods with a matching active run record.
orphaned_agent_podsAgentPodInfoDto[]Agent-host pods with no matching active run (candidates for next reaper sweep).
pending_capacity_runsPendingCapacityRunDto[]Coordinator subtasks currently in PendingCapacity status.
warm_poolsWarmPoolStatusDto[]All SandboxWarmPool CRD objects in the namespace. Empty when the cluster has no warm pools configured.
sandbox_objectsSandboxObjectDto[]All Sandbox objects in the namespace, both warm-pool-managed and per-run ad-hoc sandboxes.
sandbox_claimsSandboxClaimObjectDto[]All SandboxClaim objects in the namespace.

ComponentHealthDto

FieldTypeDescription
namestringCheck identifier. See table below for all check names.
statusstring"pass", "warn", or "fail".
detailstring|nullHuman-readable explanation of a warn or fail; null on pass.
duration_msnumberWall-clock time the check took in milliseconds. Capped at 5000 for timed-out checks.

Health check names

nameWhat it tests
postgresqlPostgres connectivity
github_installation_tokenGitHub token-store validity for the configured scope
key_vaultAzure Key Vault reachability and required mcp-oauth-signing-key lookup. critical: secret 'mcp-oauth-signing-key' not found means scripts/aks/16-provision-oauth-signing-key.sh was skipped.
agent_pod_quotaCPU headroom ≥ 2 cores in the sandbox namespace
warm_poolWarm-pool agent-sandbox availability for both pools: generic agentweaver-sandbox (replicas: 3) and AgentHost agentweaver-agent-host (replicas: 2)
kubernetes_apiKubernetes API server reachability

NamespaceQuotaDto

FieldTypeDescription
cpu_usednumberCPU consumed in the namespace, in cores.
cpu_totalnumberNamespace CPU limit, in cores.
memory_used_ginumberMemory consumed in the namespace, in GiB.
memory_total_ginumberNamespace memory limit, in GiB.

AgentPodInfoDto

Appears in both active_agent_pods and orphaned_agent_pods.

FieldTypeDescription
pod_namestringKubernetes pod name.
run_idstring|nullThe run ID the pod is serving. null for orphaned pods whose run cannot be identified.
nodestringKubernetes node the pod is running on.
started_atstring (ISO 8601)Pod creation timestamp.

PendingCapacityRunDto

FieldTypeDescription
coordinator_run_idstringThe coordinator run whose subtask is waiting.
subtask_idnumberThe subtask identifier within the work plan.
pending_sincestring (ISO 8601)When the subtask first entered PendingCapacity status.
retry_countnumberHow many dispatch retries have been attempted. Max is 10; the subtask fails with capacity_unavailable after 10 retries.

WarmPoolStatusDto

One entry per SandboxWarmPool CRD object in the namespace.

FieldTypeDescription
namestringKubernetes name of the SandboxWarmPool object.
desired_replicasnumberTarget number of pre-warmed sandbox pods declared in the CRD spec.
ready_replicasnumberSandbox pods that are ready to accept a claim.
available_replicasnumberSandbox pods that are available (ready and not currently claimed).
statusstring"healthy" when ready_replicas == desired_replicas; "warning" when some replicas are ready but below desired; "critical" when no replicas are ready.
age_secondsnumber|nullAge of the CRD object in seconds. Omitted if unavailable.

SandboxObjectDto

One entry per Sandbox object in the namespace. Covers both warm-pool-managed sandboxes and ad-hoc per-run sandboxes.

FieldTypeDescription
namestringKubernetes name of the Sandbox object.
phasestring"running", "pending", "standby", or "unknown". Standby means the sandbox is pre-warmed and waiting for a claim.
readybooleanWhether the sandbox pod is ready.
pod_namestring|nullName of the underlying pod. Omitted if not yet scheduled.
template_refstring|nullName of the SandboxTemplate used to create this sandbox. Omitted if not available.
warm_poolstring|nullName of the SandboxWarmPool that owns this sandbox. null for ad-hoc per-run sandboxes.
age_secondsnumber|nullAge of the Sandbox object in seconds. Omitted if unavailable.

SandboxClaimObjectDto

One entry per SandboxClaim object in the namespace.

FieldTypeDescription
namestringKubernetes name of the SandboxClaim object.
phasestring"bound" when assigned to a sandbox, "pending" when waiting for a matching sandbox, or "unknown".
readybooleanWhether the claimed sandbox is ready.
run_idstring|nullThe run that created this claim. Omitted if not traceable.
bound_sandboxstring|nullName of the Sandbox object this claim is bound to. null when still pending.
sandbox_template_refstring|nullSandboxTemplate requested by this claim. Omitted if not specified.
warm_poolstring|nullName of the SandboxWarmPool the bound sandbox belongs to. null for ad-hoc claims.
age_secondsnumber|nullAge of the SandboxClaim object in seconds. Omitted if unavailable.

Status codes

StatusCondition
200 OKCluster diagnostics returned successfully. Individual checks may still be warn or fail.
401 UnauthorizedMissing or invalid bearer token.
404 Not FoundCluster diagnostics endpoint not available (non-AKS deployment).
500 Internal Server ErrorUnexpected error reading cluster state.

Notes

  • All 6 component health checks run concurrently. The total response time is bounded by the slowest single check (5-second timeout), not the sum.
  • The agent_pod_quota check and the namespace_quota DTO are computed separately: the check reports a pass/warn/fail threshold judgment; the DTO reports the raw values for the quota bars in the UI.
  • The warm_pool check covers both the generic command sandbox pool and the AgentHost warm pool; an AgentHost pool below its intended two standby pods indicates slower run starts or capacity pressure.
  • Orphaned pods in orphaned_agent_pods are not terminated by this endpoint; they will be reaped on the next AgentHostReaperService sweep (default: every ~2 minutes via Coordinator:ReaperIntervalTicks).

Source

ConcernFile
Endpoint definitionapps/Agentweaver.Api/Diagnostics/DiagnosticsEndpoints.cs
Business logicapps/Agentweaver.Api/Diagnostics/DiagnosticsService.csGetClusterDiagnosticsAsync
DTO definitionsapps/Agentweaver.Api/Diagnostics/SystemDiagnosticsDto.cs