Skip to content

Cluster page

The Cluster page gives operators a real-time view of the Kubernetes cluster backing the Agentweaver AKS deployment: pod activity, quota health, component checks, and any subtasks waiting for capacity.

It is available under the Cluster nav item in the SYSTEM section of the project left rail. Route: /projects/:projectId/cluster.

Cluster page with KPI cards, quota bars, and pod tables

📸 Screenshot — cluster-page.pngShows: the Cluster page with KPI cards, quota bars (CPU, memory), the component health table with 6 checks, and the Active / Orphaned / Pending pods tables. Path: open a project → click Cluster in the SYSTEM section of the left rail → /projects/:projectId/cluster.

When to use the Cluster page

Open the Cluster page when:

  • a coordinator run shows subtasks in ⏳ Waiting for capacity (amber badge in the topology graph);
  • runs are dispatching slowly and you suspect pod scheduling or quota issues;
  • you want to confirm that all Kubernetes API components are reachable;
  • orphaned pods are accumulating (the reaper has not swept them yet);
  • after a deployment or scaling event, to confirm the cluster is healthy.

KPI cards

The four KPI cards at the top of the page give a quick cluster-health summary:

CardWhat it shows
Active podsNumber of agent-host pods currently serving a live run.
Orphaned podsPods running with no matching active run. These are candidates for the next reaper sweep (roughly every 2 minutes). A non-zero count here is a leading indicator of quota pressure.
CPU used / totalCurrent CPU consumption vs. the namespace limit, in cores.
Pending runsSubtasks that are waiting for CPU headroom to become available. Each one retries every 60 seconds for up to 10 attempts before failing with capacity_unavailable.
Warm poolReady vs. desired warm sandbox replicas across all SandboxWarmPool objects, shown as N/M. A count below M means run starts lose the fast path of a pre-warmed sandbox.

Quota bars

The two quota bars show namespace resource consumption as a percentage of the configured limit:

  • CPU — consumed core count vs. the namespace CPU limit.
  • Memory — consumed GiB vs. the namespace memory limit.

Color coding:

  • Green — below 60 % of limit
  • Amber — 60–85 % of limit
  • Red — above 85 % of limit

Cluster page with quota near-limit (red bar)

📸 Screenshot — cluster-page-quota-warning.pngShows: the Cluster page with the CPU quota bar showing a red near-limit state, and one or more subtasks in the Pending-capacity runs table. Path: open a project → click Cluster → observe a red CPU bar.

A red CPU bar combined with entries in the Pending-capacity runs table means new pods cannot be scheduled. If the Warm pool row warns, AgentHost may still work but run launch loses the fastest path because fewer than two pods are pre-warmed. Options:

  1. Wait for running pods to finish (the reaper will clean orphans within 2 minutes).
  2. Check the Orphaned pods table — if there are orphaned pods, they will be reaped on the next sweep.
  3. Scale up the katapool node pool if persistent capacity shortage is expected.

Component health table

Six checks run concurrently each time the page loads:

CheckWhat it testsTypical failure cause
PostgresConnectivity to the Postgres databaseNetwork policy, password rotation
GitHub token storeConfigured GitHub token store validity for the current scopeToken expiry, missing per-user token, GitHub API outage
Azure Key VaultKey Vault reachability and required mcp-oauth-signing-key lookupManaged identity misconfiguration, network policy, or skipped scripts/aks/16-provision-oauth-signing-key.sh
Agent pod quotaCPU headroom ≥ 2 coresToo many active pods, under-provisioned node pool
Warm poolWarm-pool agent-sandbox availability for generic sandboxes (replicas: 3) and AgentHost (replicas: 2)Warm-pool replica count below target, SandboxTemplate CRD issue
Kubernetes APIKubernetes API server reachabilityIn-cluster network policy, apiserver overload

Each check shows:

  • A status badge: pass (green), warn (amber), or fail (red).
  • A detail message (visible on warn/fail) explaining the specific failure.
  • The duration the check took in milliseconds.

All six checks have a 5-second individual timeout. A timed-out check appears as fail with the detail "timed out".

If the Key Vault row shows critical: secret 'mcp-oauth-signing-key' not found, the required OAuth signing-key provisioning step was skipped. Run scripts/aks/16-provision-oauth-signing-key.sh before redeploying; do not use the installer --skip-oauth-key flag for a production first deploy.

Active agent pods table

Lists pods currently running that have a matching active run record:

ColumnMeaning
Pod nameKubernetes pod name
Run IDThe run the pod is serving (links to the run page)
NodeKubernetes node the pod is scheduled on
Started atWhen the pod was created

A healthy system should show only pods with active runs here.

Orphaned agent pods table

Lists pods that are running but have no matching active run. These will be terminated on the next reaper sweep (default: every ~2 minutes).

If orphaned pods are not being cleaned up, check:

  • That the heartbeat is enabled and ticking (see the Heartbeat page).
  • That Coordinator:ReaperIntervalTicks is not set to an unusually large value.

Pending-capacity runs table

Lists coordinator subtasks currently in PendingCapacity status:

ColumnMeaning
Coordinator run IDThe parent coordinator run
SubtaskThe subtask waiting for capacity
Pending sinceWhen the subtask entered PendingCapacity
Retry countHow many dispatch attempts have been made (max 10)

Each subtask retries every 60 seconds. After 10 retries, the subtask fails with detail code capacity_unavailable and the OutcomeSpec panel shows a human-readable explanation.

Warm pools table

Lists every SandboxWarmPool CRD object in the namespace. Each row represents one pool:

ColumnMeaning
NameKubernetes name of the SandboxWarmPool object
DesiredTarget number of pre-warmed sandboxes declared in the pool spec
ReadySandboxes currently ready to accept a claim
AvailableSandboxes that are ready and not yet claimed by a run
Statushealthy when ready equals desired; warning when below desired; critical when none are ready

A pool in warning or critical means new run dispatches fall back to creating an ad-hoc sandbox, which adds latency to run startup.

Sandbox objects table

Lists all Sandbox CRD objects in the namespace, both warm-pool-managed and ad-hoc per-run sandboxes:

ColumnMeaning
NameKubernetes name of the Sandbox object
Phasestandby (warm, waiting for a claim), running, pending, or unknown
ReadyWhether the sandbox pod is ready
PodUnderlying pod name, if scheduled
Warm poolThe SandboxWarmPool that owns this sandbox; blank for ad-hoc sandboxes
AgeHow long the object has existed

Sandbox claims table

Lists all SandboxClaim CRD objects in the namespace:

ColumnMeaning
NameKubernetes name of the SandboxClaim object
Phasebound (assigned to a sandbox), pending (waiting for one), or unknown
ReadyWhether the claimed sandbox is ready
RunThe run ID that created this claim, linking to the run page when present
Bound sandboxThe Sandbox object this claim is bound to; blank when still pending
Warm poolThe pool the bound sandbox came from; blank for ad-hoc claims
AgeHow long the claim has existed

404 fallback

When the API is not deployed on AKS, or the cluster diagnostics endpoint is unavailable, the page displays a message indicating that cluster diagnostics are not available for this deployment. No other page functionality is affected.