Kubernetes Cluster Upgrades: Production-Ready Guide
2025-09-15
Kubernetes Cluster Upgrades: No BS Guide
Managed Kubernetes clusters need regular upgrades for security patches, bug fixes, and new features. This is a technical guide covering GKE, EKS, and AKS upgrades with real-world procedures and rollback strategies.
Prerequisites: Do This First
- Deprecated API Detection Using kubent (Universal):
# install kubent
sh -c "$(curl -sSL https://git.io/install-kubent)"
# scan cluster
kubent --target-version=1.31.0
kubent --output=json --target-version=1.31.0
GKE Log Explorer Query:
sqlresource.type="k8s_cluster"
labels."k8s.io.removed-release"="1.31"
protoPayload.authenticationInfo.principalEmail:("system:serviceaccount" OR "@")
protoPayload.authenticationInfo.principalEmail!~("system:serviceaccount:kube-system")
AWS CloudTrail (EKS):
# check EKS API calls for deprecated versions
aws logs filter-log-events --log-group-name /aws/eks/cluster/your-cluster/cluster
- Compatibility Matrix Check
- GKE: Verify Anthos Service Mesh/Istio compatibility
- EKS: Check AWS Load Balancer Controller, EBS CSI driver versions
- AKS: Validate Azure CNI, Application Gateway Ingress Controller
- Resource Assessment
# check cluster capacity
kubectl top nodes
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# review PDBs
kubectl get pdb --all-namespaces
# look for critical workloads
kubectl get deployments,statefulsets --all-namespaces
Upgrade Strategy
Environment Progression
Dev/Test clusters → 2. Staging → 3. Production (least critical → most critical)
Timing
- Off-peak hours
- Team availability for monitoring
- Consider maintenance windows for automatic upgrades
Platform-Specific Procedures
GKE Upgrade
Control Plane:
# via gcloud
gcloud container clusters upgrade CLUSTER_NAME \
--zone=us-central1-a \
--master \
--cluster-version=1.31.0-gke.1234567
# monitor
gcloud container operations list --filter="TYPE:UPGRADE_CLUSTER"
Node Pools (Configure surge settings first):
# set surge parameters
gcloud container node-pools update NODE_POOL \
--cluster=CLUSTER_NAME \
--zone=us-central1-a \
--max-surge=20 \
--max-unavailable=0
# upgrade nodes
gcloud container node-pools upgrade NODE_POOL \
--cluster=CLUSTER_NAME \
--zone=us-central1-a
EKS Upgrade
Control Plane:
# update cluster version
aws eks update-cluster-version \
--region us-west-2 \
--name my-cluster \
--kubernetes-version 1.31
# monitor status
aws eks describe-cluster --region us-west-2 --name my-cluster
Node Groups:
# update managed node group
aws eks update-nodegroup-version \
--cluster-name my-cluster \
--nodegroup-name my-nodes \
--region us-west-2 \
--kubernetes-version 1.31
# for self-managed: update launch template, then rolling update
Add-ons:
# update critical add-ons
aws eks update-addon \
--cluster-name my-cluster \
--addon-name vpc-cni \
--addon-version v1.18.1-eksbuild.1 \
--region us-west-2
AKS Upgrade
Control Plane:
# get available versions
az aks get-versions --location eastus --output table
# upgrade cluster
az aks upgrade \
--resource-group myResourceGroup \
--name myAKSCluster \
--kubernetes-version 1.31.0
Node Pools:
# upgrade specific node pool
az aks nodepool upgrade \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name mynodepool \
--kubernetes-version 1.31.0
Post-Upgrade Validation
Health Checks
# verify pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
# check nodes
kubectl get nodes
kubectl describe nodes | grep -E "Conditions|Taints"
# component status
kubectl get componentstatuses
Application Testing
- Critical endpoint validation
- Database connectivity
- Ingress/LoadBalancer functionality
- Monitor metrics and logs
Node Issues
# check for common upgrade problems
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl logs --selector=app=node-problem-detector -n kube-system
# look for: "task hung", blocked processes, resource pressure
Rollback Procedures
Control Plane Rollback
GKE:
# only within same minor version
gcloud container clusters upgrade CLUSTER_NAME \
--master \
--cluster-version=1.30.5-gke.previous
EKS/AKS: Control plane rollback not supported. Node pool rollback only.
Node Pool Rollback Strategy
Method 1: New Node Pool (Recommended)
# GKE
gcloud container node-pools create rollback-pool \
--cluster=CLUSTER_NAME \
--node-version=1.30.5-gke.previous \
--num-nodes=3
# scale up new pool
gcloud container node-pools resize rollback-pool \
--cluster=CLUSTER_NAME \
--num-nodes=10
Method 2: Workload Migration
# cordon old nodes
kubectl cordon NODE_NAME
# force workload restart to migrate
for ns in $(kubectl get ns -o name | cut -d'/' -f2); do
if [[ "$ns" != "kube-system" ]]; then
echo "Restarting $ns"
kubectl -n $ns rollout restart deployment
kubectl -n $ns rollout restart statefulset
kubectl -n $ns rollout restart daemonset
fi
done
# verify distribution
kubectl get pods -o wide --all-namespaces
# remove old pool when stable
Advanced Strategies
Blue-Green Node Pool Upgrade
- create new node pool with target version
- Migrate workloads using node selectors/taints
- Validate functionality
- Complete migration
- Remove old pool
Surge Configuration Best Practices
- Small clusters (<10 nodes): maxSurge=1, maxUnavailable=0
- Large clusters (>50 nodes): maxSurge=20, maxUnavailable=0
- Resource-constrained: maxSurge=0, maxUnavailable=1
Infrastructure as Code Updates
Terraform:
# GKE
resource "google_container_cluster" "primary" {
min_master_version = "1.31.0-gke.1234567"
}
# EKS
resource "aws_eks_cluster" "cluster" {
version = "1.31"
}
# AKS
resource "azurerm_kubernetes_cluster" "cluster" {
kubernetes_version = "1.31.0"
}
Apply with no-op verification:
# should show no changes post-upgrade
terraform plan
terraform apply
Monitoring During Upgrades
Key Metrics
- Pod scheduling latency
- Node resource utilization
- API server response times
- Application error rates
Critical Events
- Node cordoning/draining
- Pod eviction failures
- PDB violations
- Failed scheduling
Common Issues & Solutions
Stuck Node Upgrades:
- Check resource quotas
- Verify image pull capacity
- Review PodDisruptionBudgets
Application Failures:
- Validate deprecated API usage
- Check resource requests/limits
- Review network policies
Performance Degradation:
- Monitor resource pressure
- Check for node resource fragmentation
- Validate autoscaling configuration
Platform-Specific Gotchas
GKE:
- Automatic node auto-upgrades can conflict with manual upgrades
- Regional clusters take 2x longer to upgrade -Surge upgrades require additional quotas
EKS:
- Add-on compatibility critical (CNI, CSI drivers)
- Self-managed nodes require separate upgrade process
- IAM roles may need updates
AKS:
- Azure CNI version compatibility
- System node pools upgrade differently
- Virtual node pools have separate lifecycle
Reality Check: Upgrades rarely go perfectly. Plan for 2x the estimated time, have rollback procedures tested and monitor everything.
This guide reflects real production experience. Test everything in non-prod/staging/dev first, document your specific procedures and build confidence through repetition.