CSM for Resiliency is designed to make Stateful Kubernetes Applications more resilient to various failures. CSM Resiliency module can detect node failure, control plane failure (kubelet on the worker node), […]
CSM for Resiliency is designed to make Stateful Kubernetes Applications more resilient to various failures. CSM Resiliency module can detect node failure, control plane failure (kubelet on the worker node), control plane network failure (network connectivity between worker node and control plane node) and storage array IO failure and take necessary actions to bring the workload online. CSM for Resiliency and Non graceful node shutdown are mutually exclusive. One shall use either CSM for Resiliency or Non graceful node shutdown feature provided by Kubernetes. CSM Resiliency module also called podmon run as a side car container in the CSI driver controller and node pods. CSM Resiliency module provides us granular control to enable the functionality per pod. We can enable the resiliency module on a pod by adding the label podmon.dellemc.com/driver on the pod. So, in Kubernetes cluster we can have a mix of pod in which, some are monitored by the podmon, and some are not.
You can read more about the CSM Resiliency module here.
User applications can have problems if you want their Pods to be resilient to node failure. This is especially true of those deployed with StatefulSets that use PersistentVolumeClaims. Kubernetes guarantees that there will never be two copies of the same StatefulSet Pod running at the same time and accessing storage. Therefore, it does not clean up StatefulSet Pods if the node executing them fails.
For the complete discussion and rationale, you can read the pod-safety design proposal.
For more background on the forced deletion of Pods in a StatefulSet, please visit Force Delete StatefulSet Pods.
CSM for Resiliency and Non graceful node shutdown are mutually exclusive. One shall use either CSM for Resiliency or Non graceful node shutdown feature provided by Kubernetes.
CSM for Resiliency High-Level Description
CSM for Resiliency is designed to make Kubernetes Applications, including those that utilize persistent storage, more resilient to various failures. The first component of the Resiliency module is a pod monitor that is specifically designed to protect stateful applications from various failures. It is not a standalone application, but rather is deployed as a sidecar to CSI (Container Storage Interface) drivers, in both the driver’s controller pods and the driver’s node pods. Deploying CSM for Resiliency as a sidecar allows it to make direct requests to the driver through the Unix domain socket that Kubernetes sidecars use to make CSI requests.
Some of the methods CSM for Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM for Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently, there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM for Resiliency to make more accurate determinations about the state of the system and its persistent volumes. CSM for Resiliency is designed to adhere to pod affinity settings of pods.
Accordingly, CSM for Resiliency is adapted to and qualified with each CSI driver it is to be used with. Different storage systems have different nuances and characteristics that CSM for Resiliency must take into account.
In this demo we are using a vanilla Kubernetes cluster with one control plane node and three worker nodes for deploying the CSM Resiliency module.
We can install using the helm chart provided by Dell. First, we will clone the helm chart git repo and configure the values.yaml file to enable the CSM Resiliency module.
# git clone -b v2.6.0 https://github.com/dell/csi-powerstore.git
# cd csi-powerstore
# cp helm/csi-powerstore/values.yaml myvalues.yaml
# vi myvalues.yaml
# kubectl create ns csi-powerstore
# kubectl create secret generic powerstore-config -n csi-powerstore –from- file=config=config.yaml
# dell-csi-helm-installer/csi-install.sh –namespace csi-powerstore –values myvalues.yaml
# kubectl apply -f sc-powerstore-rwo.yaml
We will create two stateful set, one with podmon.dellemc.com/driver label on it and another without the label, so the pods with the label will be monitored by podmon and recovered in case of any failure and the other pod without the label will not be recovered.
# kubectl create -f sts-busybox.yaml
# kubectl create -f sts-busybox-resiliency.yaml
Now we can shut down the k8s06-node02 where the pods are running to simulate the node failure. Kubelet will send heartbeat every 10s (nodeStatusUpdateFrequency) and no heartbeat for 40 seconds (node-monitor-grace-period), the node will be marked NotReady and taints – node.kubernetes.io/unreachable:NoExecute, node.kubernetes.io/unreachable:NoSchedule will be added to the node. Kubernetes will wait 5 minutes (default-unreachable-toleration-seconds) before starting pod eviction and marking pod as terminating. As the node is down, the kubelet on the worker will not be able to delete the pod and complete the cleanup. In case of statefulset Kubernetes will not start another pod until the pod is deleted, as the pods in the statefulsets are uniquely identified. This will result in Kubernetes waiting for the pod to be deleted, indefinitely or till the failed node came back online. This is the default behavior of the Kubernetes with the statefulsets which will result in service outage till there is intervention from the administrator for recovery of the service. But with pods monitored by the CSM Resiliency module, the podmon detects the node failure and do the necessary recovery automatically to have the service back online. For the complete discussion and rationale, you can read the pod-safety design proposal.
Without the CSM Resiliency module, in order to recover from the node failure, if it’s a temporary failure the easiest is to fix the issue with the node and bring back the node online. We can also do forceful deletion of the pod and the volumeattachment objects from the Kubernetes, but we must be very careful with this approach as these are deleted from the etcd database not from the actual node and if the node come back online this may result in data corruption. You can read this to have better understanding of the risks involved in force deletion of the statefulsets pod. But when CSM Resiliency module is enabled, pods impacted by Node Failures are automatically moved to a healthy node by the CSM Resiliency module. For bringing the pod back online the CSM Resiliency module will force delete the pod and volumeattachement objects from the etcd and taints the failed node. After the failed nodes have come back online, CSM for Resiliency cleans them up (especially any potential zombie pods) and then automatically removes the CSM for Resiliency node taint that prevents pods from being scheduled to the failed node(s).
It is recommended that pods that will be monitored by CSM for Resiliency be configured to exit if they receive any I/O errors. That should help achieve the recovery as quickly as possible.
CSM for Resiliency does not directly monitor application health. However, if standard Kubernetes health checks are configured, that may help reduce pod recovery time in the event of node failure, as CSM for Resiliency should receive an event that the application is Not Ready. Note that a Not Ready pod is not sufficient to trigger CSM for Resiliency action unless there is also some condition indicating a Node failure or problem, such as the Node is tainted, or the array has lost connectivity to the node.
CSM for Resiliency has not yet been verified to work with ReadWriteMany or ReadOnlyMany volumes. Also, it has not been verified to work with pod controllers other than StatefulSet.
You can view a demo, showing how it all looks, below