Eks Elli 1.31
Its quite critical to track what’s changing in the EKS landscape as new versions release. The latest in that batch is Elli, and on AWS dubbed as EKS 1.31. Since EKS does not provide too many nobs on the feature gates, it ia critical to understand all that’s new in the Kubernetes do not make it to EKS- only those are that are stable(also known as general availability) do make it there. As of this writing EKS 1.31 is already out, and there is plenty of literature already there to tell whats out. Then “why this one”, and the answer is, “just for my record”. So here we go, and the order denotes my understanding so far, or relevance to usage at my workplace.
Allow StatefulSet to control start replica ordinal numbering
Now there is a new section in the statefulset spec
, called ordinals
, which specifies which ordinal number to start number the statefulset pods from.
type StatefulSetOrdinals struct {
// start is the number representing the first replica's index. It may be used
// to number replicas from an alternate index (eg: 1-indexed) over the default
// 0-indexed names, or to orchestrate progressive movement of replicas from
// one StatefulSet to another.
// If set, replica indices will be in the range:
// [.spec.ordinals.start, .spec.ordinals.start + .spec.replicas).
// If unset, defaults to 0. Replica indices will be in the range:
// [0, .spec.replicas).
// +optional
Start int32 `json:"start" protobuf:"varint,1,opt,name=start"`
}
Well, where does this help? It does help in terms of migrating workloads, while keeping pod names unique across namespaces or clusters.
NOTE: The application operator should manage network connectivity, volumes and slice orchestration (when to migrate and by how many replicas).
Migrating of workloads across namespaces
Suppose you have a StatefulSet
on a source
namespace and you would want to progressively move to workloads to a target
namespace keeping the pod names similar. Here is what you would do.
- Create a statefulset yaml for the
target
name withspec.ordinals.start
as equal tospec.replicas - 1
ofsource
, and keep thespec.replicas
as0
. - Create the
StatefulSet
on thetarget
namespace. - Scale down the
spec.replicas
on thesource
statefulset - Scale up the
spec.replicas
on thetarget
statefulset - Repeat 3, and 4, but ensure on 4 you also decrease the
spec.ordinals.start
for every iteration, until you hit have 0 replicas on thesource
and originalsource
replicas on the target.
Migration of workloads across clusters
Similar to working with migrations across namespaces you could also perform the same across clusters as well.
Non-zero based indexing
A user may want to number their StatefulSet starting from ordinal 1
, rather than ordinal 0
. Using 1
based numbering may be easier to reason about and conceptualize (eg: ordinal k
is the k'th
replica, not the k+1'th
replica).
But the interesting thing…
When you want to roll out the statefulsets, in the earlier versions kubernetes always brings down one pod, and then brings that one up starting, with the largest ordinal number working backwards to 0
. But with spec.ordinals
you could play around the start
to ensure at any give time you have the at least three pods of you statefulset always running. Assuming you have the following statefulset yaml.
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: podinfo
spec:
serviceName: "podinfo"
replicas: 3
selector:
matchLabels:
app: backend
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9797"
labels:
app: backend
spec:
serviceAccountName: podinfo-sa
containers:
- name: backend
image: ghcr.io/stefanprodan/podinfo:6.7.0
imagePullPolicy: IfNotPresent
memory: 32Mi
...
This will deploy pods as follows
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
podinfo-0 1/1 Running 0 36s
podinfo-1 1/1 Running 0 24s
podinfo-2 0/1 Running 0 12s
No if you want roll out new change say image version 6.7.1
, apart from the image tag change, you would need to add the following.
...
spec:
replicas: 3
ordinals:
start: 3
...
spec:
...
containers:
- name: backend
image: ghcr.io/stefanprodan/podinfo:6.7.1
...
Now deploying these changes, you could --watch
the output
$ kubectl get pods -n podinfo-test --watch
NAME READY STATUS RESTARTS AGE
podinfo-0 1/1 Running 0 3m6s
podinfo-1 1/1 Running 0 2m54s
podinfo-2 1/1 Running 0 2m43s
podinfo-3 0/1 Pending 0 0s
podinfo-3 0/1 Pending 0 0s
podinfo-3 0/1 ContainerCreating 0 0s
podinfo-3 0/1 ContainerCreating 0 1s
podinfo-3 0/1 Running 0 2s
podinfo-3 1/1 Running 0 12s
podinfo-4 0/1 Pending 0 0s
podinfo-4 0/1 Pending 0 0s
podinfo-4 0/1 ContainerCreating 0 0s
podinfo-4 0/1 ContainerCreating 0 0s
podinfo-4 0/1 Running 0 1s
podinfo-4 1/1 Running 0 11s
podinfo-5 0/1 Pending 0 0s
podinfo-5 0/1 Pending 0 0s
podinfo-5 0/1 ContainerCreating 0 0s
podinfo-5 0/1 ContainerCreating 0 1s
podinfo-5 0/1 Running 0 1s
podinfo-5 1/1 Running 0 12s
podinfo-2 1/1 Terminating 0 3m25s
podinfo-2 1/1 Terminating 0 3m28s
podinfo-2 0/1 Completed 0 3m28s
podinfo-2 0/1 Completed 0 3m28s
podinfo-2 0/1 Completed 0 3m28s
podinfo-1 1/1 Terminating 0 3m39s
podinfo-1 1/1 Terminating 0 3m42s
podinfo-1 0/1 Completed 0 3m42s
podinfo-1 0/1 Completed 0 3m43s
podinfo-1 0/1 Completed 0 3m43s
podinfo-0 1/1 Terminating 0 3m55s
podinfo-0 1/1 Terminating 0 3m59s
podinfo-0 0/1 Completed 0 3m59s
podinfo-0 0/1 Completed 0 3m59s
podinfo-0 0/1 Completed 0 3m59s
Notice how the roll out progressed, first the new requested ordinals were brought up and then older ordinals where destroyed. This ensures the roll outs are without any disruptions.
Caveat
StatefulSets that use volumeClaimTemplates
, will create pods that consume per replica PVCs. PVs are cluster scoped resources, but are bound one-to-one with namespace scoped PVCs. If the underlying storage is to be re-used in the new namespace, PVs must be unbound and manipulated appropriately.
Random Pod Selection on ReplicaSet Downscale
This feature addresses the imbalances that might occur in the following scenario in the previous versions of Kubernetes
- Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains, thus each has N pods.
- An upgrade happens adding a new available domain and the ReplicaSet is upscaled to 3N. The new domain now holds all the youngest pods due to scheduler spreading.
- ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all the Pods from one domain are removed, leading to imbalance.
The situation doesn’t improve with repeated upscale and downscale steps. Instead, a randomized approach leaves about 2/3*N nodes in each failure domain.
The original heuristic, dowscaling the youngest Pods first, has its benefits. Newer Pods might not have finished starting up (or warming up) and are likely to have less active connections than older Pods. However, this distinction doesn’t generally apply once Pods have been running steadily for some time. So this feature provides a balance.
Unhealthy Pod Eviction Policy for PDBs
With this the PodDisruptionBudget
has new spec
field called unhealthyPodEvictionPolicy
, this allows for specifies what do when routine evictions are practiced, with pods that have not yet reached healthy status, to consider them in the calculation for disruptions or not.
Just to make things more clear, a pod is considered in the calculating disruptions when it is Running, and it becomes Running the moment all the containers in the pod are running, and not necessarily in Healthy state(when the readiness has passed, as specified in the probe). So essentially it means that even before the pod is serving any traffic it is considered Running and starts to participate in Disruption. Now when such pods are encountered during node drain, they could infact be safely considered for eviction(as they are not serving any traffic) but cannot as they are already in the disruption calculations. So to override this default behavior we could use, AlwaysAllow
for unhealthyPodEvictionPolicy
. Like the following.
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: podinfo-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: backend
unhealthyPodEvictionPolicy: AlwaysAllow
The default value for this is IfHealthyBudget
, which would the status quo.
type PodDisruptionBudgetSpec struct {
...
// UnhealthyPodEvictionPolicy defines the criteria for when unhealthy pods
// should be considered for eviction. Current implementation considers healthy pods,
// as pods that have status.conditions item with type="Ready",status="True".
//
// Valid policies are IfHealthyBudget and AlwaysAllow.
// If no policy is specified, the default behavior will be used,
// which corresponds to the IfHealthyBudget policy.
//
// IfHealthyBudget policy means that running pods (status.phase="Running"),
// but not yet healthy can be evicted only if the guarded application is not
// disrupted (status.currentHealthy is at least equal to status.desiredHealthy).
// Healthy pods will be subject to the PDB for eviction.
//
// AlwaysAllow policy means that all running pods (status.phase="Running"),
// but not yet healthy are considered disrupted and can be evicted regardless
// of whether the criteria in a PDB is met. This means perspective running
// pods of a disrupted application might not get a chance to become healthy.
// Healthy pods will be subject to the PDB for eviction.
//
// Additional policies may be added in the future.
// Clients making eviction decisions should disallow eviction of unhealthy pods
// if they encounter an unrecognized policy in this field.
//
// This field is beta-level. The eviction API uses this field when
// the feature gate PDBUnhealthyPodEvictionPolicy is enabled (enabled by default).
// +optional
UnhealthyPodEvictionPolicy *UnhealthyPodEvictionPolicyType
}
// UnhealthyPodEvictionPolicyType defines the criteria for when unhealthy pods
// should be considered for eviction.
// +enum
type UnhealthyPodEvictionPolicyType string
const (
// IfHealthyBudget policy means that running pods (status.phase="Running"),
// but not yet healthy can be evicted only if the guarded application is not
// disrupted (status.currentHealthy is at least equal to status.desiredHealthy).
// Healthy pods will be subject to the PDB for eviction.
IfHealthyBudget UnhealthyPodEvictionPolicyType = "IfHealthyBudget"
// AlwaysAllow policy means that all running pods (status.phase="Running"),
// but not yet healthy are considered disrupted and can be evicted regardless
// of whether the criteria in a PDB is met. This means perspective running
// pods of a disrupted application might not get a chance to become healthy.
// Healthy pods will be subject to the PDB for eviction.
AlwaysAllow UnhealthyPodEvictionPolicyType = "AlwaysAllow"
)
PersistentVolume last phase transition time
This is a neat addition to the PersistentVolume
conditions of its status field, which records when it transitioned into a particular condition. This helps in two foreseeable ways.
- The storage administrator can now be able to look at all the
Released
volumes, and decide on creating as delete policy on the older ones. - From the performance standpoint, we could check how long it took from the
Pending
state toBound
state.
And here is the struct that holds the LastTransitionTime
.
type PersistentVolumeClaimCondition struct {
Type PersistentVolumeClaimConditionType `json:"type" protobuf:"bytes,1,opt,name=type,casttype=PersistentVolumeClaimConditionType"`
Status ConditionStatus `json:"status" protobuf:"bytes,2,opt,name=status,casttype=ConditionStatus"`
// lastProbeTime is the time we probed the condition.
// +optional
LastProbeTime metav1.Time `json:"lastProbeTime,omitempty" protobuf:"bytes,3,opt,name=lastProbeTime"`
// lastTransitionTime is the time the condition transitioned from one status to another.
// +optional
LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"`
// reason is a unique, this should be a short, machine understandable string that gives the reason
// for condition's last transition. If it reports "ResizeStarted" that means the underlying
// persistent volume is being resized.
// +optional
Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`
// message is the human-readable message indicating details about last transition.
// +optional
Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"`
}
Conclusion
There are others too that transitioned to GA for this release. If you interest in others that took to GA you could read about them here. Also the linked read, would give insights into what is getting deprecated or removed as part of this release. I will be spending more time in the future with others that have GA-ed, and will, may be, write about them soon.