2024-10-11

Eks Elli 1.31

Its quite critical to track what’s changing in the EKS landscape as new versions release. The latest in that batch is Elli, and on AWS dubbed as EKS 1.31. Since EKS does not provide too many nobs on the feature gates, it ia critical to understand all that’s new in the Kubernetes do not make it to EKS- only those are that are stable(also known as general availability) do make it there. As of this writing EKS 1.31 is already out, and there is plenty of literature already there to tell whats out. Then “why this one”, and the answer is, “just for my record”. So here we go, and the order denotes my understanding so far, or relevance to usage at my workplace.

Allow StatefulSet to control start replica ordinal numbering

Now there is a new section in the statefulset spec, called ordinals, which specifies which ordinal number to start number the statefulset pods from.

type StatefulSetOrdinals struct {
	// start is the number representing the first replica's index. It may be used
	// to number replicas from an alternate index (eg: 1-indexed) over the default
	// 0-indexed names, or to orchestrate progressive movement of replicas from
	// one StatefulSet to another.
	// If set, replica indices will be in the range:
	//   [.spec.ordinals.start, .spec.ordinals.start + .spec.replicas).
	// If unset, defaults to 0. Replica indices will be in the range:
	//   [0, .spec.replicas).
	// +optional
	Start int32 `json:"start" protobuf:"varint,1,opt,name=start"`
}

Well, where does this help? It does help in terms of migrating workloads, while keeping pod names unique across namespaces or clusters.

NOTE: The application operator should manage network connectivity, volumes and slice orchestration (when to migrate and by how many replicas).

Migrating of workloads across namespaces

Suppose you have a StatefulSet on a source namespace and you would want to progressively move to workloads to a target namespace keeping the pod names similar. Here is what you would do.

Create a statefulset yaml for the target name with spec.ordinals.start as equal to spec.replicas - 1 of source, and keep the spec.replicas as 0.
Create the StatefulSet on the target namespace.
Scale down the spec.replicas on the source statefulset
Scale up the spec.replicas on the target statefulset
Repeat 3, and 4, but ensure on 4 you also decrease the spec.ordinals.start for every iteration, until you hit have 0 replicas on the source and original source replicas on the target.

Migration of workloads across clusters

Similar to working with migrations across namespaces you could also perform the same across clusters as well.

Non-zero based indexing

A user may want to number their StatefulSet starting from ordinal 1, rather than ordinal 0. Using 1 based numbering may be easier to reason about and conceptualize (eg: ordinal k is the k'th replica, not the k+1'th replica).

But the interesting thing…

When you want to roll out the statefulsets, in the earlier versions kubernetes always brings down one pod, and then brings that one up starting, with the largest ordinal number working backwards to 0. But with spec.ordinals you could play around the start to ensure at any give time you have the at least three pods of you statefulset always running. Assuming you have the following statefulset yaml.

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: podinfo
spec:
  serviceName: "podinfo"
  replicas: 3
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9797"
      labels:
        app: backend
    spec:
      serviceAccountName: podinfo-sa
      containers:
      - name: backend
        image: ghcr.io/stefanprodan/podinfo:6.7.0
        imagePullPolicy: IfNotPresent
            memory: 32Mi
      ...

This will deploy pods as follows

$ kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
podinfo-0   1/1     Running   0          36s
podinfo-1   1/1     Running   0          24s
podinfo-2   0/1     Running   0          12s

No if you want roll out new change say image version 6.7.1, apart from the image tag change, you would need to add the following.

...
spec:
  replicas: 3
  ordinals:
    start: 3
...
    spec:
...
      containers:
      - name: backend
        image: ghcr.io/stefanprodan/podinfo:6.7.1
...

Now deploying these changes, you could --watch the output

$ kubectl get pods -n podinfo-test --watch
NAME        READY   STATUS    RESTARTS   AGE
podinfo-0   1/1     Running   0          3m6s
podinfo-1   1/1     Running   0          2m54s
podinfo-2   1/1     Running   0          2m43s
podinfo-3   0/1     Pending   0          0s
podinfo-3   0/1     Pending   0          0s
podinfo-3   0/1     ContainerCreating   0          0s
podinfo-3   0/1     ContainerCreating   0          1s
podinfo-3   0/1     Running             0          2s
podinfo-3   1/1     Running             0          12s
podinfo-4   0/1     Pending             0          0s
podinfo-4   0/1     Pending             0          0s
podinfo-4   0/1     ContainerCreating   0          0s
podinfo-4   0/1     ContainerCreating   0          0s
podinfo-4   0/1     Running             0          1s
podinfo-4   1/1     Running             0          11s
podinfo-5   0/1     Pending             0          0s
podinfo-5   0/1     Pending             0          0s
podinfo-5   0/1     ContainerCreating   0          0s
podinfo-5   0/1     ContainerCreating   0          1s
podinfo-5   0/1     Running             0          1s
podinfo-5   1/1     Running             0          12s
podinfo-2   1/1     Terminating         0          3m25s
podinfo-2   1/1     Terminating         0          3m28s
podinfo-2   0/1     Completed           0          3m28s
podinfo-2   0/1     Completed           0          3m28s
podinfo-2   0/1     Completed           0          3m28s
podinfo-1   1/1     Terminating         0          3m39s
podinfo-1   1/1     Terminating         0          3m42s
podinfo-1   0/1     Completed           0          3m42s
podinfo-1   0/1     Completed           0          3m43s
podinfo-1   0/1     Completed           0          3m43s
podinfo-0   1/1     Terminating         0          3m55s
podinfo-0   1/1     Terminating         0          3m59s
podinfo-0   0/1     Completed           0          3m59s
podinfo-0   0/1     Completed           0          3m59s
podinfo-0   0/1     Completed           0          3m59s

Notice how the roll out progressed, first the new requested ordinals were brought up and then older ordinals where destroyed. This ensures the roll outs are without any disruptions.

Caveat

StatefulSets that use volumeClaimTemplates, will create pods that consume per replica PVCs. PVs are cluster scoped resources, but are bound one-to-one with namespace scoped PVCs. If the underlying storage is to be re-used in the new namespace, PVs must be unbound and manipulated appropriately.

Random Pod Selection on ReplicaSet Downscale

This feature addresses the imbalances that might occur in the following scenario in the previous versions of Kubernetes

Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains, thus each has N pods.
An upgrade happens adding a new available domain and the ReplicaSet is upscaled to 3N. The new domain now holds all the youngest pods due to scheduler spreading.
ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all the Pods from one domain are removed, leading to imbalance.

The situation doesn’t improve with repeated upscale and downscale steps. Instead, a randomized approach leaves about 2/3*N nodes in each failure domain.

The original heuristic, dowscaling the youngest Pods first, has its benefits. Newer Pods might not have finished starting up (or warming up) and are likely to have less active connections than older Pods. However, this distinction doesn’t generally apply once Pods have been running steadily for some time. So this feature provides a balance.

Unhealthy Pod Eviction Policy for PDBs

With this the PodDisruptionBudget has new spec field called unhealthyPodEvictionPolicy, this allows for specifies what do when routine evictions are practiced, with pods that have not yet reached healthy status, to consider them in the calculation for disruptions or not.

Just to make things more clear, a pod is considered in the calculating disruptions when it is Running, and it becomes Running the moment all the containers in the pod are running, and not necessarily in Healthy state(when the readiness has passed, as specified in the probe). So essentially it means that even before the pod is serving any traffic it is considered Running and starts to participate in Disruption. Now when such pods are encountered during node drain, they could infact be safely considered for eviction(as they are not serving any traffic) but cannot as they are already in the disruption calculations. So to override this default behavior we could use, AlwaysAllow for unhealthyPodEvictionPolicy. Like the following.

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: podinfo-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: backend
  unhealthyPodEvictionPolicy: AlwaysAllow

The default value for this is IfHealthyBudget, which would the status quo.

type PodDisruptionBudgetSpec struct {
...
	// UnhealthyPodEvictionPolicy defines the criteria for when unhealthy pods
	// should be considered for eviction. Current implementation considers healthy pods,
	// as pods that have status.conditions item with type="Ready",status="True".
	//
	// Valid policies are IfHealthyBudget and AlwaysAllow.
	// If no policy is specified, the default behavior will be used,
	// which corresponds to the IfHealthyBudget policy.
	//
	// IfHealthyBudget policy means that running pods (status.phase="Running"),
	// but not yet healthy can be evicted only if the guarded application is not
	// disrupted (status.currentHealthy is at least equal to status.desiredHealthy).
	// Healthy pods will be subject to the PDB for eviction.
	//
	// AlwaysAllow policy means that all running pods (status.phase="Running"),
	// but not yet healthy are considered disrupted and can be evicted regardless
	// of whether the criteria in a PDB is met. This means perspective running
	// pods of a disrupted application might not get a chance to become healthy.
	// Healthy pods will be subject to the PDB for eviction.
	//
	// Additional policies may be added in the future.
	// Clients making eviction decisions should disallow eviction of unhealthy pods
	// if they encounter an unrecognized policy in this field.
	//
	// This field is beta-level. The eviction API uses this field when
	// the feature gate PDBUnhealthyPodEvictionPolicy is enabled (enabled by default).
	// +optional
	UnhealthyPodEvictionPolicy *UnhealthyPodEvictionPolicyType
}

// UnhealthyPodEvictionPolicyType defines the criteria for when unhealthy pods
// should be considered for eviction.
// +enum
type UnhealthyPodEvictionPolicyType string

const (
	// IfHealthyBudget policy means that running pods (status.phase="Running"),
	// but not yet healthy can be evicted only if the guarded application is not
	// disrupted (status.currentHealthy is at least equal to status.desiredHealthy).
	// Healthy pods will be subject to the PDB for eviction.
	IfHealthyBudget UnhealthyPodEvictionPolicyType = "IfHealthyBudget"

	// AlwaysAllow policy means that all running pods (status.phase="Running"),
	// but not yet healthy are considered disrupted and can be evicted regardless
	// of whether the criteria in a PDB is met. This means perspective running
	// pods of a disrupted application might not get a chance to become healthy.
	// Healthy pods will be subject to the PDB for eviction.
	AlwaysAllow UnhealthyPodEvictionPolicyType = "AlwaysAllow"
)

PersistentVolume last phase transition time

This is a neat addition to the PersistentVolume conditions of its status field, which records when it transitioned into a particular condition. This helps in two foreseeable ways.

The storage administrator can now be able to look at all the Released volumes, and decide on creating as delete policy on the older ones.
From the performance standpoint, we could check how long it took from the Pending state to Bound state.

And here is the struct that holds the LastTransitionTime.

type PersistentVolumeClaimCondition struct {
	Type   PersistentVolumeClaimConditionType `json:"type" protobuf:"bytes,1,opt,name=type,casttype=PersistentVolumeClaimConditionType"`
	Status ConditionStatus                    `json:"status" protobuf:"bytes,2,opt,name=status,casttype=ConditionStatus"`
	// lastProbeTime is the time we probed the condition.
	// +optional
	LastProbeTime metav1.Time `json:"lastProbeTime,omitempty" protobuf:"bytes,3,opt,name=lastProbeTime"`
	// lastTransitionTime is the time the condition transitioned from one status to another.
	// +optional
	LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"`
	// reason is a unique, this should be a short, machine understandable string that gives the reason
	// for condition's last transition. If it reports "ResizeStarted" that means the underlying
	// persistent volume is being resized.
	// +optional
	Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`
	// message is the human-readable message indicating details about last transition.
	// +optional
	Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"`
}

Conclusion

There are others too that transitioned to GA for this release. If you interest in others that took to GA you could read about them here. Also the linked read, would give insights into what is getting deprecated or removed as part of this release. I will be spending more time in the future with others that have GA-ed, and will, may be, write about them soon.