This is part of the Azure Well-Architected series (itself part of the main Well-Architected Framework). In this post, we're going deep on Azure — specifically, how to deploy a private AKS cluster inside an Azure Landing Zone that satisfies all five pillars of the Well-Architected Framework, fully automated with Terraform.
This isn't a "hello world" AKS deployment. This is the pattern you'd use for a regulated enterprise workload — private API server, hub-spoke networking, Azure AD–integrated RBAC, Defender for Containers, and Azure Policy guardrails. The kind of setup that passes a Well-Architected Review on the first try.
What Is an Azure Landing Zone?
An Azure Landing Zone is a pre-configured, governed environment into which you deploy workloads. Think of it as the "plot of land" your application lives on — with the networking, identity, governance, and security foundations already in place.
Microsoft's Cloud Adoption Framework defines landing zones with:
- Management Group hierarchy for policy inheritance and governance at scale
- Hub-spoke network topology with centralized egress via Azure Firewall
- Subscription-level isolation between workloads (dev, staging, production)
- Azure Policy for guardrails — enforcing tagging, encryption, allowed regions, SKUs
- Centralized logging via Log Analytics Workspace and Azure Monitor
Our AKS cluster will land inside this structure, inheriting governance and connectivity from the landing zone rather than re-inventing it.
Architecture Overview
Here's the high-level architecture we're building:
flowchart TB
subgraph MG["Management Group"]
direction LR
subgraph HUB["Hub Subscription"]
direction TB
subgraph HVNET["Hub VNet — 10.0.0.0/16"]
FW["Azure Firewall\n+ UDR"]
BASTION["Azure Bastion"]
DNS["Private DNS Zones"]
end
end
subgraph SPOKE["Spoke Subscription — AKS Workload"]
direction TB
subgraph SVNET["Spoke VNet — 10.1.0.0/16"]
subgraph AKSSUB["AKS Subnet — 10.1.0.0/22"]
AKS["Private AKS Cluster\nAzure AD RBAC\nCalico Network Policy"]
end
subgraph PESUB["Private Endpoint Subnet — 10.1.4.0/24"]
ACR["ACR\nPrivate Link"]
KV["Key Vault\nPrivate Link"]
end
end
end
end
HVNET <-->|"VNet Peering"| SVNET
AKS -->|"Egress via UDR"| FW
AKS -->|"Image Pull"| ACR
AKS -->|"Secrets"| KV
BASTION -->|"kubectl Access"| AKS
FW -->|"Filtered Egress"| INTERNET(("Internet"))
classDef hubStyle fill:#0d2137,stroke:#FF9900,stroke-width:2px,color:#e0e0e0
classDef spokeStyle fill:#0d2137,stroke:#00A4EF,stroke-width:2px,color:#e0e0e0
classDef serviceStyle fill:#1a2332,stroke:#00f2ff,stroke-width:1px,color:#e0e0e0
classDef subnetStyle fill:#112240,stroke:#00f2ff,stroke-width:1px,color:#e0e0e0,stroke-dasharray:5
classDef internetStyle fill:#2d1b37,stroke:#ff6b9d,stroke-width:2px,color:#e0e0e0
class HUB hubStyle
class SPOKE spokeStyle
class FW,BASTION,DNS,AKS,ACR,KV serviceStyle
class AKSSUB,PESUB subnetStyle
class INTERNET internetStyle
Pillar 1: Security
Security is where private AKS really earns its keep. Here's what our Terraform modules configure:
Private API Server
The AKS API server is exposed only on a private IP within the spoke VNet. No public
endpoint. kubectl access happens through Azure Bastion or a jump box in the hub,
or via an Azure DevOps self-hosted agent running inside the VNet.
resource "azurerm_kubernetes_cluster" "aks" {
name = "aks-${var.environment}-${var.region}"
location = azurerm_resource_group.spoke.location
resource_group_name = azurerm_resource_group.spoke.name
dns_prefix = "aks-${var.environment}"
private_cluster_enabled = true
private_dns_zone_id = azurerm_private_dns_zone.aks.id
network_profile {
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
outbound_type = "userDefinedRouting"
}
identity {
type = "UserAssigned"
identity_ids = [azurerm_user_assigned_identity.aks.id]
}
azure_active_directory_role_based_access_control {
azure_rbac_enabled = true
managed = true
admin_group_object_ids = [var.aks_admin_group_id]
}
}
Azure AD Integration & Managed Identity
No service principal secrets. The cluster uses a User-Assigned Managed Identity
for control plane operations and Workload Identity for pod-level Azure
resource access. Azure AD RBAC maps Kubernetes RBAC roles to Azure AD groups — so
kubectl authentication flows through your organization's identity provider.
Network Policies & NSGs
Calico network policies enforce pod-to-pod traffic rules. NSGs on every subnet restrict inbound/outbound traffic. The AKS subnet's NSG only allows traffic from the hub's Azure Firewall subnet and internal load balancers.
Azure Policy for AKS
The Azure Policy add-on for AKS enforces OPA Gatekeeper constraints directly from Azure Policy definitions. We assign built-in policy initiatives like:
Kubernetes cluster should not allow privileged containersKubernetes cluster containers should only use allowed imagesKubernetes cluster should not allow container privilege escalationKubernetes clusters should use internal load balancers
Defender for Containers
Microsoft Defender for Containers provides runtime threat detection, vulnerability scanning for ACR images, and security posture management. Enabled at the subscription level:
resource "azurerm_security_center_subscription_pricing" "containers" {
tier = "Standard"
resource_type = "ContainerRegistry"
}
resource "azurerm_security_center_subscription_pricing" "kubernetes" {
tier = "Standard"
resource_type = "KubernetesService"
}
Pillar 2: Reliability
Availability Zones
The AKS node pools span all three availability zones in the region. The node pool
configuration uses zones = ["1", "2", "3"], and the system node pool
has a minimum of 3 nodes to guarantee one node per zone.
Pod Disruption Budgets
Every workload deployed to the cluster must declare a PodDisruptionBudget. Azure Policy enforces this — deployments without a PDB are rejected at admission.
Multi-Region Readiness
The Terraform modules are parameterized by region. Spinning up a secondary cluster in a paired region is a variable change, not an architecture change. Azure Front Door or Traffic Manager handles cross-region routing.
Cluster Autoscaler + Node Auto-Provisioning
The cluster autoscaler dynamically adjusts node count based on pending pod demand. Combined with Karpenter (now in preview on AKS) or Node Auto-Provisioning (NAP), the cluster right-sizes itself continuously.
Pillar 3: Performance Efficiency
Azure CNI Overlay
We use Azure CNI Overlay instead of traditional Azure CNI. This gives pods their own CIDR range overlaid on the node network, which means:
- No VNet IP exhaustion — pods don't consume subnet IPs
- Larger node pools without massive subnet pre-allocation
- Better IP planning for hub-spoke topologies
Dedicated Node Pools
Workloads are segregated across purpose-built node pools:
| Node Pool | VM SKU | Purpose | Taints |
|---|---|---|---|
| system | Standard_D4s_v5 | CoreDNS, kube-proxy, metrics | CriticalAddonsOnly=true |
| general | Standard_D8s_v5 | Stateless application workloads | — |
| memory | Standard_E8s_v5 | In-memory caches, data-heavy pods | workload=memory:NoSchedule |
| spot | Standard_D8s_v5 | Batch jobs, non-critical processing | kubernetes.azure.com/scalesetpriority=spot |
Azure Container Registry with Private Link
ACR is connected to the spoke VNet via Private Endpoint. Image pulls happen over the Azure backbone — no public internet, no egress charges, and significantly faster pull times for large images.
Pillar 4: Cost Optimization
Spot Node Pools
Non-critical workloads (batch processing, dev/test environments, CI runners) land on Spot VM node pools at up to 90% discount. The cluster autoscaler handles eviction gracefully by rescheduling to on-demand pools when spot capacity is reclaimed.
Start/Stop for Non-Production
Dev and staging clusters use AKS's built-in start/stop capability. An Azure Automation runbook stops clusters outside business hours — saving ~65% on non-production compute:
resource "azurerm_automation_schedule" "stop_aks" {
name = "stop-aks-${var.environment}"
resource_group_name = azurerm_resource_group.spoke.name
automation_account_name = azurerm_automation_account.ops.name
frequency = "Day"
start_time = "2026-04-05T19:00:00+05:30"
timezone = "Asia/Kolkata"
}
Right-Sizing with Cost Analysis
Azure Advisor recommendations are exported to Log Analytics and surfaced in a Grafana
dashboard. Over-provisioned node pools are flagged weekly, and the Terraform
min_count/max_count values are adjusted accordingly.
Reserved Instances for Baseline
For the system and general node pools that run 24/7 in production, we use 1-year Azure Reservations on the underlying VM SKUs — locking in ~35% savings on baseline compute.
Pillar 5: Operational Excellence
Hub-Spoke Networking
All egress traffic from the AKS spoke flows through the hub's Azure Firewall via UDR (User-Defined Routes). This provides:
- Centralized egress filtering and logging
- FQDN-based rules for allowed outbound targets
- Network-level audit trail for compliance
- Single pane of glass for all spoke traffic inspection
resource "azurerm_route_table" "aks_to_firewall" {
name = "rt-aks-to-firewall"
location = azurerm_resource_group.spoke.location
resource_group_name = azurerm_resource_group.spoke.name
route {
name = "default-to-firewall"
address_prefix = "0.0.0.0/0"
next_hop_type = "VirtualAppliance"
next_hop_in_ip_address = azurerm_firewall.hub.ip_configuration[0].private_ip_address
}
}
Observability Stack
The monitoring setup uses Azure-native tooling:
- Container Insights for node/pod metrics, stdout/stderr logs, and Prometheus metric scraping
- Log Analytics Workspace as the central sink — ingesting AKS diagnostic logs, NSG flow logs, Azure Firewall logs, and Defender alerts
- Azure Managed Grafana for dashboards, connected to Azure Monitor and Log Analytics as data sources
- Azure Alerts for node pressure, pod restart loops, and failed deployments — routed to Teams/PagerDuty via Action Groups
GitOps with Flux v2
The AKS cluster bootstraps Flux v2 via the azurerm_kubernetes_flux_configuration
resource. All workload manifests, Helm charts, and Kustomize overlays are stored
in a Git repository. Changes to the cluster state are pull-based — no kubectl
apply from CI/CD pipelines, no credentials leaked in pipeline logs.
Infrastructure as Code — Everything
Every resource in this architecture is defined in Terraform. No portal clicks, no imperative scripts. The module structure:
terraform/
├── modules/
│ ├── hub-networking/ # Hub VNet, Firewall, Bastion, DNS
│ ├── spoke-networking/ # Spoke VNet, Peering, UDRs, NSGs
│ ├── aks-cluster/ # AKS + node pools + identity
│ ├── acr/ # Container Registry + Private Endpoint
│ ├── key-vault/ # Key Vault + RBAC + PE
│ ├── monitoring/ # Log Analytics, Container Insights, Alerts
│ └── governance/ # Azure Policy assignments
├── environments/
│ ├── dev.tfvars
│ ├── staging.tfvars
│ └── prod.tfvars
└── main.tf # Composition root
Mapping to the Five Pillars
Here's how the architecture maps to each Well-Architected pillar:
| Pillar | Key Implementations |
|---|---|
| Security | Private API server, Azure AD RBAC, Workload Identity, Managed Identity, Calico network policies, NSGs, Defender for Containers, Azure Policy (OPA) |
| Reliability | 3-zone node pools, PodDisruptionBudgets, cluster autoscaler, multi-region-ready modules, health probes |
| Performance | Azure CNI Overlay, dedicated node pools with taints, ACR Private Endpoint, ephemeral OS disks |
| Cost Optimization | Spot node pools, start/stop automation, Reserved Instances, right-sizing dashboards, Azure Advisor |
| Ops Excellence | Hub-spoke with Azure Firewall, Container Insights + Managed Grafana, Flux v2 GitOps, 100% Terraform IaC |
Getting Started
The complete Terraform modules for this architecture are open-source:
GitHub Repository
amartyaa/azure-waf-aks-landing-zone — Clone, configure your
.tfvars, and deploy.
To deploy:
git clone https://github.com/amartyaa/azure-waf-aks-landing-zone.git
cd azure-waf-aks-landing-zone
# Configure your environment
cp environments/dev.tfvars.example environments/dev.tfvars
# Edit dev.tfvars with your subscription, region, and AD group IDs
terraform init
terraform plan -var-file=environments/dev.tfvars
terraform apply -var-file=environments/dev.tfvars
What's Next
This post covered the Azure track of the Azure Well-Architected series. Up next:
- AWS Well-Architected with Terraform — EKS in a landing zone with VPC, Transit Gateway, and IAM Roles for Service Accounts
- GCP Architecture Framework — GKE with VPC Service Controls, Workload Identity Federation, and Organization Policies
Each post in the series comes with its own GitHub repository containing the complete, production-ready Terraform modules.
Building a private AKS deployment? Hit a snag with hub-spoke networking or Workload Identity? Get in touch or connect with me on LinkedIn.