Kubernetes Cluster Upgrades: Production-Ready Guide
2025-09-15
Kubernetes Cluster Upgrades: No BS Guide
Managed Kubernetes clusters need regular upgrades for security patches, bug fixes, and new features. This is a technical guide covering GKE, EKS, and AKS upgrades with real-world procedures and rollback strategies.
Prerequisites: Do This First
- Deprecated API Detection Using kubent (Universal):
# install kubent
sh -c "$(curl -sSL https://git.io/install-kubent)"
# scan cluster
kubent --target-version=1.31.0
kubent --output=json --target-version=1.31.0
GKE Log Explorer Query:
sqlresource.type="k8s_cluster"
labels."k8s.io.removed-release"="1.31"
protoPayload.authenticationInfo.principalEmail:("system:serviceaccount" OR "@")
protoPayload.authenticationInfo.principalEmail!~("system:serviceaccount:kube-system")
AWS CloudTrail (EKS):
# check EKS API calls for deprecated versions aws logs filter-log-events --log-group-name /aws/eks/cluster/your-cluster/cluster
- Compatibility Matrix Check
- GKE: Verify Anthos Service Mesh/Istio compatibility
- EKS: Check AWS Load Balancer Controller, EBS CSI driver versions
- AKS: Validate Azure CNI, Application Gateway Ingress Controller
- Resource Assessment
# check cluster capacity kubectl top nodes kubectl get pods --all-namespaces --field-selector=status.phase=Pending # review PDBs kubectl get pdb --all-namespaces # look for critical workloads kubectl get deployments,statefulsets --all-namespaces
Upgrade Strategy
Environment Progression
Dev/Test clusters → 2. Staging → 3. Production (least critical → most critical)
Timing
- Off-peak hours
- Team availability for monitoring
- Consider maintenance windows for automatic upgrades
Platform-Specific Procedures
GKE Upgrade
Control Plane:
# via gcloud gcloud container clusters upgrade CLUSTER_NAME \ --zone=us-central1-a \ --master \ --cluster-version=1.31.0-gke.1234567 # monitor gcloud container operations list --filter="TYPE:UPGRADE_CLUSTER" Node Pools (Configure surge settings first): # set surge parameters gcloud container node-pools update NODE_POOL \ --cluster=CLUSTER_NAME \ --zone=us-central1-a \ --max-surge=20 \ --max-unavailable=0 # upgrade nodes gcloud container node-pools upgrade NODE_POOL \ --cluster=CLUSTER_NAME \ --zone=us-central1-a
EKS Upgrade
Control Plane:
# update cluster version aws eks update-cluster-version \ --region us-west-2 \ --name my-cluster \ --kubernetes-version 1.31 # monitor status aws eks describe-cluster --region us-west-2 --name my-cluster
Node Groups:
# update managed node group aws eks update-nodegroup-version \ --cluster-name my-cluster \ --nodegroup-name my-nodes \ --region us-west-2 \ --kubernetes-version 1.31 # for self-managed: update launch template, then rolling update Add-ons: # update critical add-ons aws eks update-addon \ --cluster-name my-cluster \ --addon-name vpc-cni \ --addon-version v1.18.1-eksbuild.1 \ --region us-west-2
AKS Upgrade
Control Plane:
# get available versions az aks get-versions --location eastus --output table # upgrade cluster az aks upgrade \ --resource-group myResourceGroup \ --name myAKSCluster \ --kubernetes-version 1.31.0
Node Pools:
# upgrade specific node pool az aks nodepool upgrade \ --resource-group myResourceGroup \ --cluster-name myAKSCluster \ --name mynodepool \ --kubernetes-version 1.31.0
Post-Upgrade Validation
Health Checks
# verify pods kubectl get pods --all-namespaces --field-selector=status.phase!=Running # check nodes kubectl get nodes kubectl describe nodes | grep -E "Conditions|Taints" # component status kubectl get componentstatuses
Application Testing
- Critical endpoint validation
- Database connectivity
- Ingress/LoadBalancer functionality
- Monitor metrics and logs
Node Issues
# check for common upgrade problems kubectl get events --sort-by=.metadata.creationTimestamp kubectl logs --selector=app=node-problem-detector -n kube-system # look for: "task hung", blocked processes, resource pressure
Rollback Procedures
Control Plane Rollback
GKE:
# only within same minor version gcloud container clusters upgrade CLUSTER_NAME \ --master \ --cluster-version=1.30.5-gke.previous
EKS/AKS: Control plane rollback not supported. Node pool rollback only.
Node Pool Rollback Strategy
Method 1: New Node Pool (Recommended)
# GKE gcloud container node-pools create rollback-pool \ --cluster=CLUSTER_NAME \ --node-version=1.30.5-gke.previous \ --num-nodes=3 # scale up new pool gcloud container node-pools resize rollback-pool \ --cluster=CLUSTER_NAME \ --num-nodes=10
Method 2: Workload Migration
# cordon old nodes kubectl cordon NODE_NAME # force workload restart to migrate for ns in $(kubectl get ns -o name | cut -d'/' -f2); do if [[ "$ns" != "kube-system" ]]; then echo "Restarting $ns" kubectl -n $ns rollout restart deployment kubectl -n $ns rollout restart statefulset kubectl -n $ns rollout restart daemonset fi done # verify distribution kubectl get pods -o wide --all-namespaces # remove old pool when stable
Advanced Strategies
Blue-Green Node Pool Upgrade
- create new node pool with target version
- Migrate workloads using node selectors/taints
- Validate functionality
- Complete migration
- Remove old pool
Surge Configuration Best Practices
- Small clusters (<10 nodes): maxSurge=1, maxUnavailable=0
- Large clusters (>50 nodes): maxSurge=20, maxUnavailable=0
- Resource-constrained: maxSurge=0, maxUnavailable=1
Infrastructure as Code Updates
Terraform:
# GKE resource "google_container_cluster" "primary" { min_master_version = "1.31.0-gke.1234567" }
# EKS resource "aws_eks_cluster" "cluster" { version = "1.31" } # AKS resource "azurerm_kubernetes_cluster" "cluster" { kubernetes_version = "1.31.0" }
Apply with no-op verification:
# should show no changes post-upgrade terraform plan terraform apply
Monitoring During Upgrades
Key Metrics
- Pod scheduling latency
- Node resource utilization
- API server response times
- Application error rates
Critical Events
- Node cordoning/draining
- Pod eviction failures
- PDB violations
- Failed scheduling
Common Issues & Solutions
Stuck Node Upgrades:
- Check resource quotas
- Verify image pull capacity
- Review PodDisruptionBudgets
Application Failures:
- Validate deprecated API usage
- Check resource requests/limits
- Review network policies
Performance Degradation:
- Monitor resource pressure
- Check for node resource fragmentation
- Validate autoscaling configuration
Platform-Specific Gotchas
GKE:
- Automatic node auto-upgrades can conflict with manual upgrades
- Regional clusters take 2x longer to upgrade -Surge upgrades require additional quotas
EKS:
- Add-on compatibility critical (CNI, CSI drivers)
- Self-managed nodes require separate upgrade process
- IAM roles may need updates
AKS:
- Azure CNI version compatibility
- System node pools upgrade differently
- Virtual node pools have separate lifecycle
Reality Check: Upgrades rarely go perfectly. Plan for 2x the estimated time, have rollback procedures tested and monitor everything.
This guide reflects real production experience. Test everything in non-prod/staging/dev first, document your specific procedures and build confidence through repetition.