Deploy to AKS
This guide covers deploying Agentweaver to Azure Kubernetes Service (AKS).
One-liner deploy (recommended)
Run the full AKS provisioning — ACR, cluster, identity, Postgres, image builds, mTLS certs, and deployment — with a single command:
# macOS / Linux / WSL2
curl -fsSL https://raw.githubusercontent.com/sabbour/agentweaver/main/install.sh | bash -s -- --aks# Windows PowerShell (delegates to install.sh via WSL2)
& ([scriptblock]::Create((irm 'https://raw.githubusercontent.com/sabbour/agentweaver/main/install.ps1'))) -AksOptional flags:
| Flag (bash) | Flag (PowerShell) | Effect |
|---|---|---|
--skip-postgres | -SkipPostgres | Skip Postgres provisioning (step 17) if it already exists |
--skip-oauth-key | -SkipOauthKey | Skip OAuth signing key provisioning (step 16) only for a verified existing secret. Do not use this in production first-deploys. |
--image-tag <tag> | -ImageTag <tag> | Use this image tag instead of the short git SHA (see Redeploy) |
The installer will clone the repo to ~/agentweaver if you don't already have a local checkout, then run all provisioning steps in order.
Production warning:
scripts/aks/16-provision-oauth-signing-key.shis required before the first deploy. Skipping it leaves Key Vault withoutmcp-oauth-signing-key; cluster diagnostics reportkey_vault: critical: secret 'mcp-oauth-signing-key' not found, and OAuth/JWT validation will not be production-stable.
Prerequisites before running:
az login,kubectl,envsubst,openssl, and theaks-previewAzure CLI extension. See Prerequisites below for install links.
Redeploy / update
Re-running the installer with --image-tag builds new images, pushes them, and redeploys — this is the standard update path:
# From a cloned checkout
bash install.sh --aks --image-tag <new-git-sha>.\install.ps1 -Aks -ImageTag <new-git-sha>Or via one-liner (no local checkout required):
curl -fsSL https://raw.githubusercontent.com/sabbour/agentweaver/main/install.sh | bash -s -- --aks --image-tag <new-git-sha>Never use
:latest. Image tags are immutable per build. The default isgit rev-parse --short HEAD. Always pin to a specific SHA for reproducible, rollback-safe deployments.
Advanced: manual step-by-step installation
The one-liner above calls these 11 scripts internally, in order. Run them manually if you need to customise or resume a partial install:
scripts/aks/00-variables.sh— shared environment variables (source, don't run directly)scripts/aks/10-create-cluster.sh— ACR + AKS clusterscripts/aks/15-setup-identity.sh— managed identity + Key Vault secretsscripts/aks/16-provision-oauth-signing-key.sh— OAuth signing key (required before first deploy)scripts/aks/17-provision-postgres.sh— Azure Database for PostgreSQLscripts/aks/20-build-push-images.sh— build and push container images (no local Docker required)scripts/aks/gen-a2a-mtls-certs.sh— A2A mTLS certificates (must run before step 8)scripts/aks/30-deploy.sh— apply all Kubernetes manifestsscripts/aks/40-verify.sh— verify deployment healthscripts/aks/c3-scale-zero.sh— scale to zero (optional cost-saving)scripts/aks/c4-flip-postgres.sh— flip to external Postgres (optional)
The detailed walkthrough for each step follows below.
Prerequisites
Tools
| Tool | Minimum version | Install |
|---|---|---|
| Azure CLI | 2.80.0+ | Install guide |
aks-preview extension | latest | az extension add --upgrade --name aks-preview |
| kubectl | 1.29+ | az aks install-cli |
envsubst | any | apt install gettext / brew install gettext |
Log in before running any script:
az login
az account set --subscription <YOUR_SUBSCRIPTION_ID>GitHub OAuth App
Agentweaver uses GitHub OAuth for user authentication. Create a GitHub OAuth App before provisioning:
- Go to GitHub → Settings → Developer settings → OAuth Apps → New OAuth App
- Set Authorization callback URL to
https://<your-host>/auth/github/callback
(you can update this later once the managed domain is assigned) - Note the Client ID and generate a Client secret
Required secret values
Gather these before running scripts/aks/15-setup-identity.sh:
| Variable | Description |
|---|---|
MCP_API_KEY | Internal API loopback key for the API's Scribe/coordinator self-calls (Auth__ApiKey) |
GITHUB_CLIENT_ID | GitHub OAuth App client ID |
GITHUB_CLIENT_SECRET | GitHub OAuth App client secret |
Step 1 — Set variables
All scripts source scripts/aks/00-variables.sh, which exports shared environment variables. Override defaults before sourcing:
export RESOURCE_GROUP=agentweaver-rg
export CLUSTER_NAME=agentweaver-aks
export ACR_NAME=agentweaverregistry # globally unique, alphanumeric only
export LOCATION=westus2
export KEYVAULT_NAME=agentweaver-kv # globally unique
# IMAGE_TAG defaults to the short git SHA (recommended)
# export IMAGE_TAG=$(git rev-parse --short HEAD)
source scripts/aks/00-variables.shImage tagging:
IMAGE_TAGdefaults togit rev-parse --short HEAD(the short commit SHA). Always use a commit SHA — never:latest. Every deploy script reads this variable to tag and reference the exact images in use.
Step 2 — Create the cluster and ACR
bash scripts/aks/10-create-cluster.shThis script provisions:
- Resource group in
$LOCATION - Azure Container Registry (
$ACR_NAME) with admin auth disabled - AKS cluster (
$CLUSTER_NAME) with:
| Feature | Flag | Purpose |
|---|---|---|
| App Routing (Istio variant) | --enable-app-routing-istio | approuting-istio GatewayClass + managed LoadBalancer |
| Gateway API | --enable-gateway-api | Required for HTTPRoute resources |
| Managed default domain | --enable-default-domain | Provisions a *.azureaksapps.io domain + TLS cert |
| Azure CNI Overlay | --network-plugin azure --network-plugin-mode overlay | Pod networking |
| Cilium dataplane | --network-dataplane cilium | NetworkPolicy enforcement |
| ACNS | --enable-acns | CiliumNetworkPolicy FQDN-based egress + network observability |
| AzureLinux nodes | --os-sku AzureLinux | Hardened node OS (all pools) |
| System pool taint | --node-taints CriticalAddonsOnly=true:NoSchedule | Restricts nodepool1 to critical-addon pods; app workloads use apppool |
| Cluster Autoscaler (system pool) | --enable-cluster-autoscaler --min-count 1 --max-count 3 | Scales system node pool automatically (1–3 nodes) |
| Key Vault CSI driver | --enable-addons azure-keyvault-secrets-provider | Mounts Key Vault secrets as volumes |
| Workload Identity | --enable-oidc-issuer --enable-workload-identity | Federated credentials for pods |
| ACR pull-through | --attach-acr $ACR_ID | No imagePullSecret required |
After cluster creation the script adds two user node pools via az aks nodepool add:
| Pool | Mode | workloadRuntime | Autoscaler | Taint | Label | Receives |
|---|---|---|---|---|---|---|
nodepool1 | System | (standard) | 1–3 nodes | CriticalAddonsOnly=true:NoSchedule | — | kube-system / critical addons only |
apppool | User | (standard) | 1–5 nodes | (none) | — | api, worker, mcp, frontend, jobs |
katapool | User | KataVmIsolation | 1–5 nodes | sandbox=kata:NoSchedule | agentweaver.io/kata=true | Sandbox / AgentHost pods |
Why a dedicated app pool?
CriticalAddonsOnly=true:NoScheduleon the system pool is the AKS-recommended way to reserve it for cluster-critical components. No tolerations are needed in any application deployment YAML — app workloads schedule ontoapppoolby default, which has no taint. Sandbox and AgentHost pods land onkatapoolvia their existingSandboxTemplatetoleration (sandbox=kata:NoSchedule) and preferrednodeAffinity(agentweaver.io/kata=true).
NAP vs cluster-autoscaler:
--node-provisioning-mode Auto(Node Auto Provisioning) is not used because NAP and cluster-autoscaler are mutually exclusive. Kata VM isolation requires--workload-runtime KataVmIsolationon a fixed user pool with--enable-cluster-autoscaler.
After cluster creation the script installs the agent-sandbox CRD controller:
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.6/release.yamlThis provides the SandboxClaim, SandboxTemplate, and SandboxWarmPool CRDs used by the sandbox executor.
Feature registration: Some flags are in preview. If
az aks createfails with "feature not registered":bashaz feature register --namespace Microsoft.ContainerService --name AKSAppRoutingGatewayAPI az feature register --namespace Microsoft.ContainerService --name AKS-KataVMIsolation az provider register --namespace Microsoft.ContainerService # Wait until Registered: az feature show --namespace Microsoft.ContainerService --name AKSAppRoutingGatewayAPI \ --query properties.state -o tsv
Step 3 — Set up identity and secrets
export TENANT_ID=$(az account show --query tenantId -o tsv)
export GITHUB_CLIENT_ID=<from-github-oauth-app>
export GITHUB_CLIENT_SECRET=<from-github-oauth-app>
bash scripts/aks/15-setup-identity.shThe script:
- Creates a user-assigned managed identity (
agentweaver-api-identity) - Creates an Azure Key Vault (
$KEYVAULT_NAME) with RBAC authorization enabled - Stores two secrets in Key Vault:
github-client-id— GitHub OAuth App client IDgithub-client-secret— GitHub OAuth App client secret
- Grants the managed identity Key Vault Secrets User on the vault
- Enables OIDC issuer + workload identity on the cluster (if not already enabled)
- Creates a federated credential (
agentweaver-api-fedcred) linkingserviceaccount/agentweaver-apiin namespaceagentweaverto the managed identity - Creates a second federated credential (
agentweaver-agenthost-fedcred) linkingserviceaccount/agentweaver-agent-hostto the same managed identity — required for warm agent-host pods to fetch the configured user token from Key Vault at/configuretime
Required before first deploy: run
bash scripts/aks/16-provision-oauth-signing-key.shto provision themcp-oauth-signing-keysecret in Key Vault. Do not pass--skip-oauth-key/-SkipOauthKeyfor a production first deploy; use it only when you have verified the secret already exists. Missing this step shows up in diagnostics askey_vault: critical: secret 'mcp-oauth-signing-key' not found.
At completion, export the identity client ID for use in Step 5:
export IDENTITY_CLIENT_ID=$(az identity show \
--name agentweaver-api-identity \
--resource-group "${RESOURCE_GROUP}" \
--query clientId -o tsv)How secrets flow from Key Vault to pods
Azure Key Vault
└── github-client-id ─┐
└── github-client-secret ├─ SecretProviderClass: agentweaver-secrets
└── mcp-oauth-signing-key ─┘ (k8s/secret-provider-class.yaml)
│
│ CSI driver fetches via pod workload identity
▼
Pod volume: /mnt/secrets-store/
github-client-id (file)
github-client-secret (file)
mcp-oauth-signing-key (file)
│
│ API startup script reads files:
▼
env vars injected at runtime (not in YAML):
GitHub__ClientId
GitHub__ClientSecret
Auth__OAuth__SigningKey
The MCP pod mounts no secrets; its auth relies only on the OAuth paths.The CSI volume mount is what triggers SecretProviderClass synchronization — without the volume attached, secrets are never fetched.
AgentHost user tokens do not use a shared fan-out SPC or per-run SPC. Each authenticated user's GitHub OAuth token is stored in Key Vault under a per-user key (ghtok-user--{base32(userId)}). The AgentHost template receives AgentHost__KeyVaultUri as static config, and warm AgentHost pods run in standby at replicas: 2. When a run launches, KubernetesSandboxExecutor claims one warm pod and calls POST /configure with the run owner's Key Vault secret name. The pod fetches only that secret via workload identity (SecretClient + DefaultAzureCredential) and caches it in memory. There is no /mnt/user-tokens/ mount and no per-run SPC/template/warm-pool cleanup.
Step 4 — Build and push container images
bash scripts/aks/20-build-push-images.shBuilds five images via az acr build (no local Docker daemon required). The build runs remotely in ACR:
| Image | Dockerfile | Build context |
|---|---|---|
agentweaver-api:${IMAGE_TAG} | apps/Agentweaver.Api/Dockerfile | repo root |
agentweaver-frontend:${IMAGE_TAG} | apps/web/Dockerfile | repo root |
agentweaver-mcp:${IMAGE_TAG} | apps/Agentweaver.Mcp/Dockerfile | repo root |
agentweaver-sandbox:${IMAGE_TAG} | apps/agentweaver-sandbox/Dockerfile | apps/agentweaver-sandbox/ |
agentweaver-agent-host:${AGENTHOST_IMAGE_TAG} | apps/Agentweaver.AgentHost/Dockerfile | repo root |
The API, frontend, MCP, and AgentHost images use the repo root as the build context because their Dockerfiles reference multiple project subdirectories. The sandbox base image is self-contained.
The AgentHost Dockerfile publishes with dotnet publish --runtime linux-x64 --self-contained false. The RID is required for GitHub.Copilot.SDK to extract its native copilot CLI into the publish output. Without it, AgentHost pods crash with Copilot runtime not found at '/app/runtimes/linux-x64/native/copilot'.
Example output after a successful build:
agentweaverregistry.azurecr.io/agentweaver-api:abc1234
agentweaverregistry.azurecr.io/agentweaver-frontend:abc1234
agentweaverregistry.azurecr.io/agentweaver-mcp:abc1234
agentweaverregistry.azurecr.io/agentweaver-sandbox:abc1234
agentweaverregistry.azurecr.io/agentweaver-agent-host:abc1234Step 5 — Deploy
IDENTITY_CLIENT_ID=<from-step-3> \
KEYVAULT_NAME=agentweaver-kv \
TENANT_ID=$(az account show --query tenantId -o tsv) \
bash scripts/aks/30-deploy.shThe script aborts if IDENTITY_CLIENT_ID, KEYVAULT_NAME, or TENANT_ID are unset.
What 30-deploy.sh does
- Applies
k8s/namespace.yaml(creates theagentweavernamespace) - Creates a
DefaultDomainCertificateresource namedcertand waits until it becomesAvailable - Derives the managed hostname:
agentweaver.<wildcard-domain>(e.g.agentweaver.abc123.westus2.aksapp.io) - Renders all
k8s/*.yamlmanifests usingenvsubst, substituting:${HOST}— managed hostname${ACR_LOGIN_SERVER}— ACR login server${IMAGE_TAG}— commit SHA${IDENTITY_CLIENT_ID}— managed identity client ID${KEYVAULT_NAME}— Key Vault name${TENANT_ID}— Azure tenant ID
- Applies resources in this order:
serviceaccount-api.yaml→secret-provider-class.yaml→rbac-api.yaml→quota.yaml→pvc-data.yaml→pvc-workspace.yaml- Network policies and Cilium egress policies
- Services, Gateway, HTTPRoutes, backup CronJob
- Sandbox template and warm pool (skipped if CRDs not installed)
- Waits for Gateway
Programmed=True - Applies deployments (api, frontend, mcp)
- Waits for all three rollouts to complete
At completion:
Frontend URL: https://agentweaver.<domain>/
API URL: https://agentweaver.<domain>/api/
MCP URL: https://agentweaver.<domain>/mcp/Kubernetes manifests overview
All manifests live in k8s/. The deploy script applies them in dependency order.
Core infrastructure
| File | Kind | Purpose |
|---|---|---|
namespace.yaml | Namespace | agentweaver namespace with app.kubernetes.io/part-of: agentweaver label |
serviceaccount-api.yaml | ServiceAccount | agentweaver-api SA, annotated with managed identity client ID for workload identity |
rbac-api.yaml | Role + RoleBinding | Grants agentweaver-api SA permission to create/delete SandboxClaim, read/update sandbox resources, get/create pods, and create pods/exec. Per-run SPC/template/warm-pool creation is no longer part of AgentHost launch. |
quota.yaml | ResourceQuota + LimitRange + PodDisruptionBudget | Namespace quota (25 pods, 8 CPU req, 16Gi mem req, 20 sandbox claims); default container limits; PDBs for api/mcp/frontend |
Secrets
| File | Kind | Purpose |
|---|---|---|
secret-provider-class.yaml | SecretProviderClass | Defines agentweaver-secrets for API app secrets. AgentHost user tokens are fetched from Key Vault at /configure time; no per-run user-token SPC is created. |
Storage
| File | Kind | Purpose |
|---|---|---|
pvc-workspace.yaml | PersistentVolumeClaim | agentweaver-workspace — 50 Gi Azure Files Premium (RWX), mounted at /workspace for agent worktrees |
Network policies
| File | Kind | Purpose |
|---|---|---|
networkpolicy-default-deny.yaml | NetworkPolicy (×6) | Default-deny ingress/egress; allows gateway→api, gateway→frontend, DNS, internal pod traffic, external HTTPS for api+mcp |
networkpolicy-mcp.yaml | NetworkPolicy | Allows gateway ingress to MCP pod on port 8080 |
networkpolicy-sandbox.yaml | NetworkPolicy (×2) | Deny-all ingress to sandbox pods; egress allow-list: DNS + HTTPS to GitHub IP range |
cilium-network-policy-sandbox.yaml | CiliumNetworkPolicy | FQDN-based egress for sandbox pods: api.github.com, registry.npmjs.org, Azure AI services |
serviceentry-telemetry.yaml | CiliumNetworkPolicy | FQDN-based egress for app pods: GitHub, Azure AI services, OpenTelemetry collector |
Networking (Gateway API)
| File | Kind | Purpose |
|---|---|---|
gateway.yaml | Gateway | agentweaver-gateway — HTTPS listener on port 443, TLS terminate with managed cert, gatewayClassName: approuting-istio |
gateway-preview.yaml | Gateway | agentweaver-preview-gateway — dedicated HTTPS gateway for sandbox browser preview; shares the managed *.aksapp.io wildcard cert. Applied by 30-deploy.sh. |
httproute-api.yaml | HTTPRoute | Routes PathPrefix: /api and /auth → agentweaver-api:8080 |
httproute-frontend.yaml | HTTPRoute | Routes PathPrefix: / (catch-all) → agentweaver-frontend:80 |
mcp-httproute.yaml | HTTPRoute | Routes PathPrefix: /mcp → agentweaver-mcp:8080; rewrites /mcp/health → /healthz |
Workloads
| File | Kind | Purpose |
|---|---|---|
api-deployment.yaml | Deployment | API pod — 2 replicas, PostgreSQL-backed, init container runs EF migrations |
api-service.yaml | Service | agentweaver-api ClusterIP :8080 |
frontend-deployment.yaml | Deployment | Frontend pods — 2 replicas, serves React SPA |
frontend-service.yaml | Service | agentweaver-frontend ClusterIP :80 → :8080 |
mcp-deployment.yaml | Deployment | MCP server — 1 replica, forwards caller Bearer token to API |
mcp-service.yaml | Service | agentweaver-mcp ClusterIP :8080 |
Sandbox
| File | Kind | Purpose |
|---|---|---|
sandbox-template.yaml | SandboxTemplate | Template for isolated pods — kata-vm-isolation runtime, non-root, read-only rootfs, workspace PVC |
sandbox-template-agenthost.yaml | SandboxTemplate | AgentHost template — image ${AGENTHOST_IMAGE_TAG}, workload identity, AgentHost__KeyVaultUri, no user-token CSI mount. Per-run context is delivered by /configure. |
sandbox-warmpool.yaml | SandboxWarmPool | Keeps 3 pre-warmed generic sandbox pods ready (agentweaver-sandbox) |
sandbox-warmpool-agenthost.yaml | SandboxWarmPool | agentweaver-agent-host pool (replicas: 2) — keeps two AgentHost pods pre-warmed in standby; ops can tune this file for startup/capacity trade-offs |
sandbox-claim-template.yaml | (template) | Reference v1beta1 SandboxClaim shape — spec.warmPoolRef.name + spec.lifecycle |
AgentHost launch no longer creates per-run SPCs, cloned templates, or per-run warm pools. It binds a pod from the shared warm pool, calls /configure with run/user/token/KV-secret context, and releases only the claim when the run completes or suspends.
TLS and HTTPS
TLS is handled by the AKS App Routing add-on with a managed DefaultDomainCertificate.
How it works
The deploy script creates a
DefaultDomainCertificateresource namedcertin theagentweavernamespace:yamlapiVersion: approuting.kubernetes.azure.com/v1alpha1 kind: DefaultDomainCertificate metadata: name: cert namespace: agentweaver spec: target: secret: agentweaver-tlsThe App Routing controller provisions a wildcard certificate for the cluster's managed domain (e.g.
*.abc123.westus2.aksapp.io) and stores it inSecret/agentweaver-tls.k8s/gateway.yamlreferences the secret in its TLS listener:yamltls: mode: Terminate certificateRefs: - kind: Secret name: agentweaver-tlsThe gateway terminates TLS and forwards plain HTTP to backend services inside the cluster.
Check certificate status
kubectl get defaultdomaincertificate cert -n agentweaver -o yaml
# status.conditions should include Available=True
# status.domain contains the wildcard domainUpdate the GitHub OAuth callback URL
Once the managed domain is assigned, update the GitHub OAuth App's callback URL:
HOST=$(kubectl get defaultdomaincertificate cert -n agentweaver \
-o jsonpath='{.status.domain}' | sed 's/^\*\.//')
echo "Callback URL: https://agentweaver.${HOST}/auth/github/callback"Network policies
The agentweaver namespace enforces default-deny with explicit allow rules. All policies are in k8s/networkpolicy-default-deny.yaml (plus sandbox-specific files).
Ingress rules
| Policy | Selector | Allows |
|---|---|---|
default-deny-ingress | all app.kubernetes.io/part-of: agentweaver pods (gateway excluded) | Denies all ingress by default |
allow-gateway-to-api (unnamed in YAML) | app: agentweaver-api | Ingress on :8080 from gateway pods or aks-istio-ingress namespace |
allow-gateway-to-frontend | app: agentweaver-frontend | Ingress on :8080 from gateway pods or aks-istio-ingress namespace |
allow-gateway-to-mcp | app: agentweaver-mcp | Ingress on :8080 from gateway pods or aks-istio-ingress namespace |
sandbox-deny-ingress | app: agentweaver-sandbox | Denies all ingress by default; preview ingress on TCP 3000–9000 from the preview gateway is explicitly allowed by sandbox-allow-preview-ingress (see Sandbox browser preview) |
allow-worker-to-agenthost-a2a | app: agentweaver-sandbox | Allows only worker/API pods to reach AgentHost on TCP :8088; message:stream also requires the per-run bearer token |
Gateway pods are identified by gateway.networking.k8s.io/gateway-name: agentweaver-gateway, which the approuting-istio controller sets automatically.
Egress rules
| Policy | Selector | Allows |
|---|---|---|
default-deny-egress-apps | api, mcp, frontend pods | Denies all egress by default |
allow-app-dns-egress | api, mcp, frontend pods | UDP/TCP :53 to kube-dns in kube-system |
allow-app-internal-egress | api, mcp, frontend pods | TCP :8080 to other app.kubernetes.io/part-of: agentweaver pods |
allow-app-external-https-egress | api, mcp pods only | TCP :443 to any external host |
allow-api-agenthost-egress | api pods | TCP :8088 to AgentHost pods for A2A turns |
allow-worker-agenthost-egress | worker pods | TCP :8088 to AgentHost pods for A2A turns |
sandbox-egress-allowlist | app: agentweaver-sandbox | DNS + TCP :443 to GitHub IP range 140.82.112.0/20 |
Cilium FQDN egress (sandbox)
k8s/cilium-network-policy-sandbox.yaml narrows sandbox internet egress to specific hostnames:
api.github.com— GitHub REST APIregistry.npmjs.org,*.npmjs.org— npm packages*.services.ai.azure.com,*.openai.azure.com,*.cognitiveservices.azure.com,*.models.ai.azure.com— Azure AI services
Cilium FQDN egress (app pods)
k8s/serviceentry-telemetry.yaml controls app pod (api, mcp, frontend) external egress:
api.github.com,github.com,*.github.com— GitHub API (auth token validation, org membership)- Azure AI service domains
otel-collector.observability.svc.cluster.local:4317— OpenTelemetry collector
Sandbox setup
The API uses a Kubernetes-native sandbox executor when running in-cluster. Each agent run claims a pre-warmed pod from the SandboxWarmPool, executes commands via pods/exec, then releases the claim.
Prerequisites
The agent-sandbox controller is installed by scripts/aks/10-create-cluster.sh. Verify:
kubectl api-resources --api-group=extensions.agents.x-k8s.io
# Should list: sandboxclaims, sandboxtemplates, sandboxwarmpoolsSandbox pod characteristics
From k8s/sandbox-template.yaml:
- Runtime:
kata-vm-isolation— hardware VM-grade isolation (not just container isolation) - Security: non-root (UID 1000), read-only rootfs, no capabilities, seccomp RuntimeDefault
- Volumes: workspace PVC at
/workspace,emptyDirat/tmp - Resources: 256Mi–4Gi RAM, 250m–1000m CPU
- Env injection: Disallowed (sandbox cannot inherit API pod env vars)
Verify warm pool
# Should show 3 pods in Running state
kubectl get pods -n agentweaver -l app=agentweaver-sandbox
# Check warm pool status
kubectl get sandboxwarmpool agentweaver-sandbox -n agentweaver \
-o jsonpath='{.status}' | jq
# status.readyReplicas should equal 3Claim lifecycle
- API creates a
SandboxClaim(e.g.run-a1b2c3d4e5f6g7h8) - Controller binds it to a warm pod (sub-second when pool is warm)
- API runs commands via Kubernetes
pods/execWebSocket - On completion, claim is deleted; controller terminates the pod and refills the pool
Sandbox browser preview
The sandbox browser preview lets an agent (or an operator) expose a server running inside its sandbox pod at a public HTTPS URL, scoped to that one run. It is enabled by default in the AKS deployment.
How it works
When an agent calls the start_preview MCP tool (or an operator clicks Preview in the UI), the API:
- Resolves the run's bound sandbox pod from the
SandboxClaimstatus (replica-safe — no in-process registry). - Mints an unguessable 128-bit capability token and derives the host
{token}-preview.{ZoneSuffix}. - Creates a ClusterIP Service (
preview-{token}) whose selector targets that specific run's pod at the requested port. - Creates an HTTPRoute (
preview-{token}) that attaches toagentweaver-preview-gateway, matches the{token}-preview.{ZoneSuffix}hostname, and backends the Service. - Returns the public
preview_url(e.g.https://{token}-preview.6a41f26c75d5cf00019ef7d7.westus2.staging.aksapp.io).
The request is routed through a human-in-the-loop approval gate (AgentPreviewGate) before any provisioning happens. By default (SANDBOX_PREVIEW_AUTO_APPROVE=false) an operator must grant the request via POST /api/runs/{runId}/tool-approvals; the agent call suspends until approved or the 5-minute window times out.
Prerequisites
Two resources must be in place before the first preview can be served:
| Requirement | Detail |
|---|---|
Sandbox__Preview__Enabled=true | Set in the API pod's environment (injected by 30-deploy.sh). Activates the Gateway-direct path and the reaper service. |
k8s/gateway-preview.yaml applied | Creates the agentweaver-preview-gateway Gateway; applied by 30-deploy.sh. |
The deploy script sets the three required env vars automatically from the cluster's managed domain:
| Env var | ASP.NET config key | Production value |
|---|---|---|
Sandbox__Preview__Enabled | Sandbox:Preview:Enabled | true |
Sandbox__Preview__ZoneSuffix | Sandbox:Preview:ZoneSuffix | 6a41f26c75d5cf00019ef7d7.westus2.staging.aksapp.io |
Sandbox__Preview__GatewayName | Sandbox:Preview:GatewayName | agentweaver-preview-gateway |
Approval gate
The start_preview tool routes through AgentPreviewGate before provisioning:
- Default (human-gated):
SANDBOX_PREVIEW_AUTO_APPROVE=false— the run timeline receives atool.approval_requiredevent and the agent waits up to 5 minutes for an operator to callPOST /api/runs/{runId}/tool-approvals. - Auto-approve: Set
SANDBOX_PREVIEW_AUTO_APPROVE=trueon the API pod to skip the gate (useful for automated demos). Do not enable in production without understanding the security implications.
Preview lifecycle
| Mechanism | Default | Behaviour |
|---|---|---|
| Sliding idle TTL | 30 min | SandboxPreviewReaperService sweeps every ~60 s; the frontend pings keepalive every 60 s to extend the window. Stop pinging and the preview lapses within the idle window. |
| Hard cap | 8 h | A preview is always reaped after this, regardless of keepalive. |
| Pod gone | — | If the backing sandbox pod no longer exists, the reaper removes the preview as an orphan. |
| Explicit stop | — | DELETE /api/runs/{runId}/sandbox/port-forward/{token} immediately removes the HTTPRoute and Service. |
The reaper reads all decision inputs from cluster state (HTTPRoute annotations + pod existence), so both API replicas reconcile identically — no leader election, no in-memory timers.
Verify preview gateway
# Gateway should be Programmed=True with an assigned address
kubectl get gateway agentweaver-preview-gateway -n agentweaver
# NetworkPolicy admitting preview gateway → sandbox ports 3000-9000
kubectl get networkpolicy sandbox-allow-preview-ingress -n agentweaver
# List active preview HTTPRoutes (empty when no previews are running)
kubectl get httproute -n agentweaver -l agentweaver.dev/preview=trueSee the Sandbox browser preview deep dive for the full wiring and security model, and the reference for all configuration knobs.
bash scripts/aks/40-verify.shChecks performed:
- Pod running counts (api ≥1, frontend ≥1, mcp ≥1, sandbox warm pods ≥1)
- Gateway
Programmed=Truewith an assigned address - All three HTTPRoutes
Accepted=TrueandResolvedRefs=True - Static
agentweaver-secretsSecretProviderClass exists and at least one mounted API SPC has synced; AgentHost user tokens are fetched at runtime from Key Vault, not mounted via SPC - API RBAC Role and RoleBinding present; SA can create SandboxClaims and
pods/exec kata-vm-isolationRuntimeClass presentSandboxTemplateandSandboxWarmPoolexist- HTTP smoke tests:
GET /,GET /api/health,GET /mcp/healthall return 200
Updating a deployment
To deploy a new version:
# Rebuild images with the new commit SHA
export IMAGE_TAG=$(git rev-parse --short HEAD)
bash scripts/aks/20-build-push-images.sh
# Re-deploy (re-renders manifests with new IMAGE_TAG)
IDENTITY_CLIENT_ID=<value> \
KEYVAULT_NAME=agentweaver-kv \
TENANT_ID=$(az account show --query tenantId -o tsv) \
bash scripts/aks/30-deploy.shThe API Deployment uses strategy: Recreate — the old pod terminates before the new one starts, ensuring the RWO Azure Disk is not multi-attached.
Troubleshooting
Gateway not Programmed
kubectl describe gateway agentweaver-gateway -n agentweaverCommon causes:
DefaultDomainCertificatenot yetAvailable— wait a few minutes; check withkubectl get defaultdomaincertificate cert -n agentweaver- Preview feature not registered — see Step 2 feature registration note
gatewayClassName: approuting-istionot available — confirm the cluster was created with--enable-app-routing-istio
Pods in ImagePullBackOff
kubectl describe pod <pod-name> -n agentweaver | grep -A10 EventsCommon causes:
- ACR not attached:
az aks show --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP --query addonProfiles.acrProfile - Wrong image tag:
$IMAGE_TAGmust match what was pushed in Step 4 - Image not pushed: re-run
scripts/aks/20-build-push-images.sh
API pod in CrashLoopBackOff
kubectl logs -n agentweaver -l app=agentweaver-api --previousCommon causes:
- PVC not bound:
kubectl get pvc -n agentweaver— wait forSTATUS=Bound - Secrets not synced: check CSI driver logs:
kubectl logs -n kube-system -l app=secrets-store-csi-driver - Missing
IDENTITY_CLIENT_ID/KEYVAULT_NAME/TENANT_IDat deploy time — theSecretProviderClasswill have wrong values and the CSI driver will fail to authenticate - EF migration failed: check init container logs:
kubectl logs -n agentweaver <api-pod-name> -c migrate-memory-db
502 / 503 from frontend or API
The NetworkPolicy may be blocking gateway traffic. Verify the gateway pods carry the expected label:
kubectl get pods -n agentweaver \
-l gateway.networking.k8s.io/gateway-name=agentweaver-gateway \
--show-labelsThe networkpolicy-default-deny.yaml allows ingress from pods with label gateway.networking.k8s.io/gateway-name: agentweaver-gateway. If those pods are missing or carry a different label, the policy will silently drop traffic.
Secrets not mounted in pod
# Check SecretProviderClassPodStatus objects
kubectl get secretproviderclasspodstatus -n agentweaver
# Check CSI driver on the node
kubectl logs -n kube-system -l app=secrets-store-csi-driver | tail -50If the managed identity federation is misconfigured, the CSI driver will log 403 errors from Key Vault.
If cluster diagnostics show key_vault: critical: secret 'mcp-oauth-signing-key' not found, the required signing-key provisioning step was skipped. Run bash scripts/aks/16-provision-oauth-signing-key.sh and redeploy; do not work around this with --skip-oauth-key in production.
For AgentHost pods, there is no run-scoped SPC to check. Verify the warm pool and pod configuration instead:
kubectl describe sandboxwarmpool agentweaver-agent-host -n agentweaver
kubectl describe pod <agenthost-pod-name> -n agentweaverThe pod should have AgentHost__KeyVaultUri, workload identity labels/env, no /mnt/user-tokens mount, and logs showing standby followed by /configure and agent ready.
Sandbox pods not appearing in warm pool
kubectl describe sandboxwarmpool agentweaver-sandbox -n agentweaver
kubectl get events -n agentweaver --sort-by='.lastTimestamp' | tail -20Common causes:
kata-vm-isolationRuntimeClass missing — re-runscripts/aks/10-create-cluster.shor checkkubectl get runtimeclass- Agent-sandbox controller not running:
kubectl get pods -n agent-sandbox-system - ACR pull failure for sandbox image — check pod events on a sandbox pod
AgentHost pod crashes with missing Copilot runtime
If an AgentHost pod logs Copilot runtime not found at '/app/runtimes/linux-x64/native/copilot', rebuild agentweaver-agent-host from apps/Agentweaver.AgentHost/Dockerfile. Its dotnet publish step must include --runtime linux-x64 --self-contained false; publishing without a RID omits the GitHub.Copilot.SDK native binary from the image.
Istio / approuting-istio clarification
The approuting-istio GatewayClass is used for gateway routing only — provisioning the public LoadBalancer and TLS termination. It does not enroll workload pods in an Istio service mesh. No sidecars, no ambient mode, no ztunnel runs on agentweaver workload pods.
Inter-pod security is enforced exclusively by Cilium NetworkPolicy. If you see unexpected traffic drops between pods, inspect Cilium resources — not Istio:
kubectl get networkpolicies -n agentweaver
kubectl get ciliumnetworkpolicies -n agentweaver