Kubernetes Cluster Upgrades: Production-Ready Guide

2025-09-15

Kubernetes Cluster Upgrades: No BS Guide

Managed Kubernetes clusters need regular upgrades for security patches, bug fixes, and new features. This is a technical guide covering GKE, EKS, and AKS upgrades with real-world procedures and rollback strategies.

Prerequisites: Do This First

Deprecated API Detection Using kubent (Universal):

# install kubent
sh -c "$(curl -sSL https://git.io/install-kubent)"

# scan cluster
kubent --target-version=1.31.0
kubent --output=json --target-version=1.31.0

GKE Log Explorer Query:

sqlresource.type="k8s_cluster"
labels."k8s.io.removed-release"="1.31"
protoPayload.authenticationInfo.principalEmail:("system:serviceaccount" OR "@")
protoPayload.authenticationInfo.principalEmail!~("system:serviceaccount:kube-system")

AWS CloudTrail (EKS):

# check EKS API calls for deprecated versions
aws logs filter-log-events --log-group-name /aws/eks/cluster/your-cluster/cluster

Compatibility Matrix Check

GKE: Verify Anthos Service Mesh/Istio compatibility
EKS: Check AWS Load Balancer Controller, EBS CSI driver versions
AKS: Validate Azure CNI, Application Gateway Ingress Controller

Resource Assessment

# check cluster capacity
kubectl top nodes
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# review PDBs
kubectl get pdb --all-namespaces

# look for critical workloads
kubectl get deployments,statefulsets --all-namespaces

Upgrade Strategy

Environment Progression

Dev/Test clusters → 2. Staging → 3. Production (least critical → most critical)

Timing

Off-peak hours
Team availability for monitoring
Consider maintenance windows for automatic upgrades

Platform-Specific Procedures

GKE Upgrade

Control Plane:

# via gcloud
gcloud container clusters upgrade CLUSTER_NAME \
  --zone=us-central1-a \
  --master \
  --cluster-version=1.31.0-gke.1234567

# monitor
gcloud container operations list --filter="TYPE:UPGRADE_CLUSTER"
Node Pools (Configure surge settings first):
# set surge parameters
gcloud container node-pools update NODE_POOL \
  --cluster=CLUSTER_NAME \
  --zone=us-central1-a \
  --max-surge=20 \
  --max-unavailable=0

# upgrade nodes
gcloud container node-pools upgrade NODE_POOL \
  --cluster=CLUSTER_NAME \
  --zone=us-central1-a

EKS Upgrade

Control Plane:

# update cluster version
aws eks update-cluster-version \
  --region us-west-2 \
  --name my-cluster \
  --kubernetes-version 1.31

# monitor status
aws eks describe-cluster --region us-west-2 --name my-cluster

Node Groups:

# update managed node group
aws eks update-nodegroup-version \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --region us-west-2 \
  --kubernetes-version 1.31

# for self-managed: update launch template, then rolling update
Add-ons:
# update critical add-ons
aws eks update-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.1-eksbuild.1 \
  --region us-west-2

AKS Upgrade

Control Plane:

# get available versions
az aks get-versions --location eastus --output table

# upgrade cluster
az aks upgrade \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --kubernetes-version 1.31.0

Node Pools:

# upgrade specific node pool
az aks nodepool upgrade \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name mynodepool \
  --kubernetes-version 1.31.0

Post-Upgrade Validation

Health Checks

# verify pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running

# check nodes
kubectl get nodes
kubectl describe nodes | grep -E "Conditions|Taints"

# component status
kubectl get componentstatuses

Application Testing

Critical endpoint validation
Database connectivity
Ingress/LoadBalancer functionality
Monitor metrics and logs

Node Issues

# check for common upgrade problems
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl logs --selector=app=node-problem-detector -n kube-system

# look for: "task hung", blocked processes, resource pressure

Rollback Procedures

Control Plane Rollback

GKE:

# only within same minor version
gcloud container clusters upgrade CLUSTER_NAME \
  --master \
  --cluster-version=1.30.5-gke.previous

EKS/AKS: Control plane rollback not supported. Node pool rollback only.

Node Pool Rollback Strategy

Method 1: New Node Pool (Recommended)

# GKE
gcloud container node-pools create rollback-pool \
  --cluster=CLUSTER_NAME \
  --node-version=1.30.5-gke.previous \
  --num-nodes=3

# scale up new pool
gcloud container node-pools resize rollback-pool \
  --cluster=CLUSTER_NAME \
  --num-nodes=10

Method 2: Workload Migration

# cordon old nodes
kubectl cordon NODE_NAME

# force workload restart to migrate
for ns in $(kubectl get ns -o name | cut -d'/' -f2); do
  if [[ "$ns" != "kube-system" ]]; then
    echo "Restarting $ns"
    kubectl -n $ns rollout restart deployment
    kubectl -n $ns rollout restart statefulset  
    kubectl -n $ns rollout restart daemonset
  fi
done

# verify distribution
kubectl get pods -o wide --all-namespaces

# remove old pool when stable

Advanced Strategies

Blue-Green Node Pool Upgrade

create new node pool with target version
Migrate workloads using node selectors/taints
Validate functionality
Complete migration
Remove old pool

Surge Configuration Best Practices

Small clusters (<10 nodes): maxSurge=1, maxUnavailable=0
Large clusters (>50 nodes): maxSurge=20, maxUnavailable=0
Resource-constrained: maxSurge=0, maxUnavailable=1

Infrastructure as Code Updates

Terraform:

# GKE 
resource "google_container_cluster" "primary" {
  min_master_version = "1.31.0-gke.1234567"
}

# EKS  
resource "aws_eks_cluster" "cluster" {
  version = "1.31"
}

# AKS
resource "azurerm_kubernetes_cluster" "cluster" {
  kubernetes_version = "1.31.0"
}

Apply with no-op verification:

# should show no changes post-upgrade
terraform plan 
terraform apply

Monitoring During Upgrades

Key Metrics

Pod scheduling latency
Node resource utilization
API server response times
Application error rates

Critical Events

Node cordoning/draining
Pod eviction failures
PDB violations
Failed scheduling

Common Issues & Solutions

Stuck Node Upgrades:

Check resource quotas
Verify image pull capacity
Review PodDisruptionBudgets

Application Failures:

Validate deprecated API usage
Check resource requests/limits
Review network policies

Performance Degradation:

Monitor resource pressure
Check for node resource fragmentation
Validate autoscaling configuration

Platform-Specific Gotchas

GKE:

Automatic node auto-upgrades can conflict with manual upgrades
Regional clusters take 2x longer to upgrade -Surge upgrades require additional quotas

EKS:

Add-on compatibility critical (CNI, CSI drivers)
Self-managed nodes require separate upgrade process
IAM roles may need updates

AKS:

Azure CNI version compatibility
System node pools upgrade differently
Virtual node pools have separate lifecycle

Reality Check: Upgrades rarely go perfectly. Plan for 2x the estimated time, have rollback procedures tested and monitor everything.

This guide reflects real production experience. Test everything in non-prod/staging/dev first, document your specific procedures and build confidence through repetition.