April 5, 2026 15 min read Well-Arch Series

Azure Well-Architected: Private AKS in a Landing Zone with Terraform

Well-Arch Azure AKS Terraform Landing Zone

This is part of the Azure Well-Architected series (itself part of the main Well-Architected Framework). In this post, we're going deep on Azure — specifically, how to deploy a private AKS cluster inside an Azure Landing Zone that satisfies all five pillars of the Well-Architected Framework, fully automated with Terraform.

This isn't a "hello world" AKS deployment. This is the pattern you'd use for a regulated enterprise workload — private API server, hub-spoke networking, Azure AD–integrated RBAC, Defender for Containers, and Azure Policy guardrails. The kind of setup that passes a Well-Architected Review on the first try.

What Is an Azure Landing Zone?

An Azure Landing Zone is a pre-configured, governed environment into which you deploy workloads. Think of it as the "plot of land" your application lives on — with the networking, identity, governance, and security foundations already in place.

Microsoft's Cloud Adoption Framework defines landing zones with:

  • Management Group hierarchy for policy inheritance and governance at scale
  • Hub-spoke network topology with centralized egress via Azure Firewall
  • Subscription-level isolation between workloads (dev, staging, production)
  • Azure Policy for guardrails — enforcing tagging, encryption, allowed regions, SKUs
  • Centralized logging via Log Analytics Workspace and Azure Monitor

Our AKS cluster will land inside this structure, inheriting governance and connectivity from the landing zone rather than re-inventing it.

Architecture Overview

Here's the high-level architecture we're building:

flowchart TB
    subgraph MG["Management Group"]
        direction LR
        subgraph HUB["Hub Subscription"]
            direction TB
            subgraph HVNET["Hub VNet — 10.0.0.0/16"]
                FW["Azure Firewall\n+ UDR"]
                BASTION["Azure Bastion"]
                DNS["Private DNS Zones"]
            end
        end
        subgraph SPOKE["Spoke Subscription — AKS Workload"]
            direction TB
            subgraph SVNET["Spoke VNet — 10.1.0.0/16"]
                subgraph AKSSUB["AKS Subnet — 10.1.0.0/22"]
                    AKS["Private AKS Cluster\nAzure AD RBAC\nCalico Network Policy"]
                end
                subgraph PESUB["Private Endpoint Subnet — 10.1.4.0/24"]
                    ACR["ACR\nPrivate Link"]
                    KV["Key Vault\nPrivate Link"]
                end
            end
        end
    end

    HVNET <-->|"VNet Peering"| SVNET
    AKS -->|"Egress via UDR"| FW
    AKS -->|"Image Pull"| ACR
    AKS -->|"Secrets"| KV
    BASTION -->|"kubectl Access"| AKS
    FW -->|"Filtered Egress"| INTERNET(("Internet"))

    classDef hubStyle fill:#0d2137,stroke:#FF9900,stroke-width:2px,color:#e0e0e0
    classDef spokeStyle fill:#0d2137,stroke:#00A4EF,stroke-width:2px,color:#e0e0e0
    classDef serviceStyle fill:#1a2332,stroke:#00f2ff,stroke-width:1px,color:#e0e0e0
    classDef subnetStyle fill:#112240,stroke:#00f2ff,stroke-width:1px,color:#e0e0e0,stroke-dasharray:5
    classDef internetStyle fill:#2d1b37,stroke:#ff6b9d,stroke-width:2px,color:#e0e0e0

    class HUB hubStyle
    class SPOKE spokeStyle
    class FW,BASTION,DNS,AKS,ACR,KV serviceStyle
    class AKSSUB,PESUB subnetStyle
    class INTERNET internetStyle
                    

Pillar 1: Security

Security is where private AKS really earns its keep. Here's what our Terraform modules configure:

Private API Server

The AKS API server is exposed only on a private IP within the spoke VNet. No public endpoint. kubectl access happens through Azure Bastion or a jump box in the hub, or via an Azure DevOps self-hosted agent running inside the VNet.

resource "azurerm_kubernetes_cluster" "aks" {
  name                    = "aks-${var.environment}-${var.region}"
  location                = azurerm_resource_group.spoke.location
  resource_group_name     = azurerm_resource_group.spoke.name
  dns_prefix              = "aks-${var.environment}"
  private_cluster_enabled = true
  private_dns_zone_id     = azurerm_private_dns_zone.aks.id

  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    load_balancer_sku = "standard"
    outbound_type     = "userDefinedRouting"
  }

  identity {
    type = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.aks.id]
  }

  azure_active_directory_role_based_access_control {
    azure_rbac_enabled     = true
    managed                = true
    admin_group_object_ids = [var.aks_admin_group_id]
  }
}

Azure AD Integration & Managed Identity

No service principal secrets. The cluster uses a User-Assigned Managed Identity for control plane operations and Workload Identity for pod-level Azure resource access. Azure AD RBAC maps Kubernetes RBAC roles to Azure AD groups — so kubectl authentication flows through your organization's identity provider.

Network Policies & NSGs

Calico network policies enforce pod-to-pod traffic rules. NSGs on every subnet restrict inbound/outbound traffic. The AKS subnet's NSG only allows traffic from the hub's Azure Firewall subnet and internal load balancers.

Azure Policy for AKS

The Azure Policy add-on for AKS enforces OPA Gatekeeper constraints directly from Azure Policy definitions. We assign built-in policy initiatives like:

  • Kubernetes cluster should not allow privileged containers
  • Kubernetes cluster containers should only use allowed images
  • Kubernetes cluster should not allow container privilege escalation
  • Kubernetes clusters should use internal load balancers

Defender for Containers

Microsoft Defender for Containers provides runtime threat detection, vulnerability scanning for ACR images, and security posture management. Enabled at the subscription level:

resource "azurerm_security_center_subscription_pricing" "containers" {
  tier          = "Standard"
  resource_type = "ContainerRegistry"
}

resource "azurerm_security_center_subscription_pricing" "kubernetes" {
  tier          = "Standard"
  resource_type = "KubernetesService"
}

Pillar 2: Reliability

Availability Zones

The AKS node pools span all three availability zones in the region. The node pool configuration uses zones = ["1", "2", "3"], and the system node pool has a minimum of 3 nodes to guarantee one node per zone.

Pod Disruption Budgets

Every workload deployed to the cluster must declare a PodDisruptionBudget. Azure Policy enforces this — deployments without a PDB are rejected at admission.

Multi-Region Readiness

The Terraform modules are parameterized by region. Spinning up a secondary cluster in a paired region is a variable change, not an architecture change. Azure Front Door or Traffic Manager handles cross-region routing.

Cluster Autoscaler + Node Auto-Provisioning

The cluster autoscaler dynamically adjusts node count based on pending pod demand. Combined with Karpenter (now in preview on AKS) or Node Auto-Provisioning (NAP), the cluster right-sizes itself continuously.

Pillar 3: Performance Efficiency

Azure CNI Overlay

We use Azure CNI Overlay instead of traditional Azure CNI. This gives pods their own CIDR range overlaid on the node network, which means:

  • No VNet IP exhaustion — pods don't consume subnet IPs
  • Larger node pools without massive subnet pre-allocation
  • Better IP planning for hub-spoke topologies

Dedicated Node Pools

Workloads are segregated across purpose-built node pools:

Node Pool VM SKU Purpose Taints
system Standard_D4s_v5 CoreDNS, kube-proxy, metrics CriticalAddonsOnly=true
general Standard_D8s_v5 Stateless application workloads
memory Standard_E8s_v5 In-memory caches, data-heavy pods workload=memory:NoSchedule
spot Standard_D8s_v5 Batch jobs, non-critical processing kubernetes.azure.com/scalesetpriority=spot

Azure Container Registry with Private Link

ACR is connected to the spoke VNet via Private Endpoint. Image pulls happen over the Azure backbone — no public internet, no egress charges, and significantly faster pull times for large images.

Pillar 4: Cost Optimization

Spot Node Pools

Non-critical workloads (batch processing, dev/test environments, CI runners) land on Spot VM node pools at up to 90% discount. The cluster autoscaler handles eviction gracefully by rescheduling to on-demand pools when spot capacity is reclaimed.

Start/Stop for Non-Production

Dev and staging clusters use AKS's built-in start/stop capability. An Azure Automation runbook stops clusters outside business hours — saving ~65% on non-production compute:

resource "azurerm_automation_schedule" "stop_aks" {
  name                    = "stop-aks-${var.environment}"
  resource_group_name     = azurerm_resource_group.spoke.name
  automation_account_name = azurerm_automation_account.ops.name
  frequency               = "Day"
  start_time              = "2026-04-05T19:00:00+05:30"
  timezone                = "Asia/Kolkata"
}

Right-Sizing with Cost Analysis

Azure Advisor recommendations are exported to Log Analytics and surfaced in a Grafana dashboard. Over-provisioned node pools are flagged weekly, and the Terraform min_count/max_count values are adjusted accordingly.

Reserved Instances for Baseline

For the system and general node pools that run 24/7 in production, we use 1-year Azure Reservations on the underlying VM SKUs — locking in ~35% savings on baseline compute.

Pillar 5: Operational Excellence

Hub-Spoke Networking

All egress traffic from the AKS spoke flows through the hub's Azure Firewall via UDR (User-Defined Routes). This provides:

  • Centralized egress filtering and logging
  • FQDN-based rules for allowed outbound targets
  • Network-level audit trail for compliance
  • Single pane of glass for all spoke traffic inspection
resource "azurerm_route_table" "aks_to_firewall" {
  name                = "rt-aks-to-firewall"
  location            = azurerm_resource_group.spoke.location
  resource_group_name = azurerm_resource_group.spoke.name

  route {
    name                   = "default-to-firewall"
    address_prefix         = "0.0.0.0/0"
    next_hop_type          = "VirtualAppliance"
    next_hop_in_ip_address = azurerm_firewall.hub.ip_configuration[0].private_ip_address
  }
}

Observability Stack

The monitoring setup uses Azure-native tooling:

  • Container Insights for node/pod metrics, stdout/stderr logs, and Prometheus metric scraping
  • Log Analytics Workspace as the central sink — ingesting AKS diagnostic logs, NSG flow logs, Azure Firewall logs, and Defender alerts
  • Azure Managed Grafana for dashboards, connected to Azure Monitor and Log Analytics as data sources
  • Azure Alerts for node pressure, pod restart loops, and failed deployments — routed to Teams/PagerDuty via Action Groups

GitOps with Flux v2

The AKS cluster bootstraps Flux v2 via the azurerm_kubernetes_flux_configuration resource. All workload manifests, Helm charts, and Kustomize overlays are stored in a Git repository. Changes to the cluster state are pull-based — no kubectl apply from CI/CD pipelines, no credentials leaked in pipeline logs.

Infrastructure as Code — Everything

Every resource in this architecture is defined in Terraform. No portal clicks, no imperative scripts. The module structure:

terraform/
├── modules/
│   ├── hub-networking/      # Hub VNet, Firewall, Bastion, DNS
│   ├── spoke-networking/    # Spoke VNet, Peering, UDRs, NSGs
│   ├── aks-cluster/         # AKS + node pools + identity
│   ├── acr/                 # Container Registry + Private Endpoint
│   ├── key-vault/           # Key Vault + RBAC + PE
│   ├── monitoring/          # Log Analytics, Container Insights, Alerts
│   └── governance/          # Azure Policy assignments
├── environments/
│   ├── dev.tfvars
│   ├── staging.tfvars
│   └── prod.tfvars
└── main.tf                  # Composition root

Mapping to the Five Pillars

Here's how the architecture maps to each Well-Architected pillar:

Pillar Key Implementations
Security Private API server, Azure AD RBAC, Workload Identity, Managed Identity, Calico network policies, NSGs, Defender for Containers, Azure Policy (OPA)
Reliability 3-zone node pools, PodDisruptionBudgets, cluster autoscaler, multi-region-ready modules, health probes
Performance Azure CNI Overlay, dedicated node pools with taints, ACR Private Endpoint, ephemeral OS disks
Cost Optimization Spot node pools, start/stop automation, Reserved Instances, right-sizing dashboards, Azure Advisor
Ops Excellence Hub-spoke with Azure Firewall, Container Insights + Managed Grafana, Flux v2 GitOps, 100% Terraform IaC

Getting Started

The complete Terraform modules for this architecture are open-source:

Live

GitHub Repository

amartyaa/azure-waf-aks-landing-zone — Clone, configure your .tfvars, and deploy.

To deploy:

git clone https://github.com/amartyaa/azure-waf-aks-landing-zone.git
cd azure-waf-aks-landing-zone

# Configure your environment
cp environments/dev.tfvars.example environments/dev.tfvars
# Edit dev.tfvars with your subscription, region, and AD group IDs

terraform init
terraform plan -var-file=environments/dev.tfvars
terraform apply -var-file=environments/dev.tfvars

What's Next

This post covered the Azure track of the Azure Well-Architected series. Up next:

  • AWS Well-Architected with Terraform — EKS in a landing zone with VPC, Transit Gateway, and IAM Roles for Service Accounts
  • GCP Architecture Framework — GKE with VPC Service Controls, Workload Identity Federation, and Organization Policies

Each post in the series comes with its own GitHub repository containing the complete, production-ready Terraform modules.


Building a private AKS deployment? Hit a snag with hub-spoke networking or Workload Identity? Get in touch or connect with me on LinkedIn.