Kubernetes Cluster Upgrades: Production-Ready Guide

2025-09-15

Kubernetes Cluster Upgrades: No BS Guide

Managed Kubernetes clusters need regular upgrades for security patches, bug fixes, and new features. This is a technical guide covering GKE, EKS, and AKS upgrades with real-world procedures and rollback strategies.

Prerequisites: Do This First

  1. Deprecated API Detection Using kubent (Universal):
# install kubent
sh -c "$(curl -sSL https://git.io/install-kubent)"

# scan cluster
kubent --target-version=1.31.0
kubent --output=json --target-version=1.31.0

GKE Log Explorer Query:

sqlresource.type="k8s_cluster"
labels."k8s.io.removed-release"="1.31"
protoPayload.authenticationInfo.principalEmail:("system:serviceaccount" OR "@")
protoPayload.authenticationInfo.principalEmail!~("system:serviceaccount:kube-system")

AWS CloudTrail (EKS):

# check EKS API calls for deprecated versions
aws logs filter-log-events --log-group-name /aws/eks/cluster/your-cluster/cluster
  1. Compatibility Matrix Check
  • GKE: Verify Anthos Service Mesh/Istio compatibility
  • EKS: Check AWS Load Balancer Controller, EBS CSI driver versions
  • AKS: Validate Azure CNI, Application Gateway Ingress Controller
  1. Resource Assessment
# check cluster capacity
kubectl top nodes
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# review PDBs
kubectl get pdb --all-namespaces

# look for critical workloads
kubectl get deployments,statefulsets --all-namespaces

Upgrade Strategy

Environment Progression

Dev/Test clusters → 2. Staging → 3. Production (least critical → most critical)

Timing

  • Off-peak hours
  • Team availability for monitoring
  • Consider maintenance windows for automatic upgrades

Platform-Specific Procedures

GKE Upgrade

Control Plane:

# via gcloud
gcloud container clusters upgrade CLUSTER_NAME \
  --zone=us-central1-a \
  --master \
  --cluster-version=1.31.0-gke.1234567

# monitor
gcloud container operations list --filter="TYPE:UPGRADE_CLUSTER"
Node Pools (Configure surge settings first):
# set surge parameters
gcloud container node-pools update NODE_POOL \
  --cluster=CLUSTER_NAME \
  --zone=us-central1-a \
  --max-surge=20 \
  --max-unavailable=0

# upgrade nodes
gcloud container node-pools upgrade NODE_POOL \
  --cluster=CLUSTER_NAME \
  --zone=us-central1-a

EKS Upgrade

Control Plane:

# update cluster version
aws eks update-cluster-version \
  --region us-west-2 \
  --name my-cluster \
  --kubernetes-version 1.31

# monitor status
aws eks describe-cluster --region us-west-2 --name my-cluster

Node Groups:

# update managed node group
aws eks update-nodegroup-version \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --region us-west-2 \
  --kubernetes-version 1.31

# for self-managed: update launch template, then rolling update
Add-ons:
# update critical add-ons
aws eks update-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.1-eksbuild.1 \
  --region us-west-2

AKS Upgrade

Control Plane:

# get available versions
az aks get-versions --location eastus --output table

# upgrade cluster
az aks upgrade \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --kubernetes-version 1.31.0

Node Pools:

# upgrade specific node pool
az aks nodepool upgrade \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name mynodepool \
  --kubernetes-version 1.31.0

Post-Upgrade Validation

Health Checks

# verify pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running

# check nodes
kubectl get nodes
kubectl describe nodes | grep -E "Conditions|Taints"

# component status
kubectl get componentstatuses

Application Testing

  • Critical endpoint validation
  • Database connectivity
  • Ingress/LoadBalancer functionality
  • Monitor metrics and logs

Node Issues

# check for common upgrade problems
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl logs --selector=app=node-problem-detector -n kube-system

# look for: "task hung", blocked processes, resource pressure

Rollback Procedures

Control Plane Rollback

GKE:

# only within same minor version
gcloud container clusters upgrade CLUSTER_NAME \
  --master \
  --cluster-version=1.30.5-gke.previous

EKS/AKS: Control plane rollback not supported. Node pool rollback only.

Node Pool Rollback Strategy

Method 1: New Node Pool (Recommended)

# GKE
gcloud container node-pools create rollback-pool \
  --cluster=CLUSTER_NAME \
  --node-version=1.30.5-gke.previous \
  --num-nodes=3

# scale up new pool
gcloud container node-pools resize rollback-pool \
  --cluster=CLUSTER_NAME \
  --num-nodes=10

Method 2: Workload Migration

# cordon old nodes
kubectl cordon NODE_NAME

# force workload restart to migrate
for ns in $(kubectl get ns -o name | cut -d'/' -f2); do
  if [[ "$ns" != "kube-system" ]]; then
    echo "Restarting $ns"
    kubectl -n $ns rollout restart deployment
    kubectl -n $ns rollout restart statefulset  
    kubectl -n $ns rollout restart daemonset
  fi
done

# verify distribution
kubectl get pods -o wide --all-namespaces

# remove old pool when stable

Advanced Strategies

Blue-Green Node Pool Upgrade

  1. create new node pool with target version
  2. Migrate workloads using node selectors/taints
  3. Validate functionality
  4. Complete migration
  5. Remove old pool

Surge Configuration Best Practices

  • Small clusters (<10 nodes): maxSurge=1, maxUnavailable=0
  • Large clusters (>50 nodes): maxSurge=20, maxUnavailable=0
  • Resource-constrained: maxSurge=0, maxUnavailable=1

Infrastructure as Code Updates

Terraform:

# GKE 
resource "google_container_cluster" "primary" {
  min_master_version = "1.31.0-gke.1234567"
}
# EKS  
resource "aws_eks_cluster" "cluster" {
  version = "1.31"
}

# AKS
resource "azurerm_kubernetes_cluster" "cluster" {
  kubernetes_version = "1.31.0"
}

Apply with no-op verification:

# should show no changes post-upgrade
terraform plan 
terraform apply

Monitoring During Upgrades

Key Metrics

  • Pod scheduling latency
  • Node resource utilization
  • API server response times
  • Application error rates

Critical Events

  • Node cordoning/draining
  • Pod eviction failures
  • PDB violations
  • Failed scheduling

Common Issues & Solutions

Stuck Node Upgrades:

  • Check resource quotas
  • Verify image pull capacity
  • Review PodDisruptionBudgets

Application Failures:

  • Validate deprecated API usage
  • Check resource requests/limits
  • Review network policies

Performance Degradation:

  • Monitor resource pressure
  • Check for node resource fragmentation
  • Validate autoscaling configuration

Platform-Specific Gotchas

GKE:

  • Automatic node auto-upgrades can conflict with manual upgrades
  • Regional clusters take 2x longer to upgrade -Surge upgrades require additional quotas

EKS:

  • Add-on compatibility critical (CNI, CSI drivers)
  • Self-managed nodes require separate upgrade process
  • IAM roles may need updates

AKS:

  • Azure CNI version compatibility
  • System node pools upgrade differently
  • Virtual node pools have separate lifecycle

Reality Check: Upgrades rarely go perfectly. Plan for 2x the estimated time, have rollback procedures tested and monitor everything.

This guide reflects real production experience. Test everything in non-prod/staging/dev first, document your specific procedures and build confidence through repetition.