A few months ago, we introduced ‘Project Karavi’. The first module we introduced, was the ‘Obsevarability’ one, followed by, the ‘Authorization’ one. Today, I’m happy to share, the third module we […]
A few months ago, we introduced ‘Project Karavi’. The first module we introduced, was the ‘Obsevarability’ one, followed by, the ‘Authorization’ one.
Today, I’m happy to share, the third module we are releasing: ‘Resiliency’
You see, just like it took VMware years to understand all sort of complex storage resiliency issues and, how to treat them (think granual PDL vs APD etc), Kubernetes is in it’s infancy where it comes to understanding these failure scenarios as well and as such, we decided to inject some smartness into it, which is the whole purpose of ‘CSM’, and put extra IP into it, understanding storage failure scenarios.
User applications can have problems if you want their Pods to be resilient to node failure. This is especially true of those deployed with StatefulSets that use PersistentVolumeClaims. Kubernetes guarantees that there will never be two copies of the same StatefulSet Pod running at the same time and accessing storage. Therefore, it does not clean up StatefulSet Pods if the node executing them fails.
For the complete discussion and rationale, go to https://github.com/kubernetes/community and search for the pod-safety.md file (path: contributors/design-proposals/storage/pod-safety.md). For more background on forced deletion of Pods in a StatefulSet, please visit Force Delete StatefulSet Pods.
CSM Resiliency High Level Description
CSM Resiliency is a project designed to make Kubernetes Applications, including those that utilize persistent storage, more resilient to various failures. The first component of CSM Resiliency is a pod monitor that is specifically designed to protect stateful applications from various failures. It is not a standalone application, but rather is deployed as a sidecar to CSI (Container Storage Interface) drivers, in both the driver’s controller pods and the driver’s node pods. Deploying CSM Resiliency as a sidecar allows it to make direct requests to the driver through the Unix domain socket that Kubernetes sidecars use to make CSI requests.
Some of the methods CSM Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM Resiliency to make more accurate determinations about the state of the system and its persistent volumes.
Accordingly CSM Resiliency is adapted to, and qualified with each CSI driver it is to be used with. Different storage systems have different nuances and characteristics that CSM Resiliency must take into account.
CSM Resiliency is currently in a Technical Preview Phase, and should be considered alpha software. We are actively seeking feedback from users about its features, effectiveness, and reliability. Please provide feedback using the firstname.lastname@example.org email alias. We will take that input, along with our own results from doing extensive testing, and incrementally improve the software. We do not recommend or support it for production use at this time.
CSM Resiliency is primarily designed to detect pod failures due some kind of node failure or node communication failure. The diagram below shows illustrates the hardware environment that is assumed in the design
A Kubernetes Control Plane is assumed to exist that provides the K8S API service which is used by CSM Resiliency. There are an arbitrary number of worker nodes (two are shown in the diagram) are connected to the Control Plane through a K8S Control Plane IP Network.
The worker nodes (e.g. Node1 and Node2) can run a mix of CSM Resiliency monitored Application Pods as well as unmonitored Application Pods. Monitored Pods are designated by a specific label that is applied to each monitored pod. The label key and value are configurable for each driver type when CSM Resiliency is installed, and must be unique for each driver instance.
The Worker Nodes are assumed to also have a connection to a Storage System Array (such as PowerFlex.) It is often preferred that a separate network be used for storage access from the network used by the K8S control plane, and CSM Resiliency takes advantage of the separate networks when available.
CSM Resiliency does not generally try to handle any of the following errors:
Failure of the Kubernetes control plane, the etcd database used by Kubernetes, or the like. Kubernetes is generally designed to provide a highly availble container orchestration system and it is assumed clients follow standard and/or best practices in configuring their Kubernetes deployments.
CSM Resiliency is generally not designed to take action upon a failure solely of the Application Pod(s). Applications are still responsible for detecting and providing recovery mechanisms should their appplication fail. There are some specific recommendations for applications to be monitored by CSM Resiliency that are described later.
CSM Resiliency’s design is focused on detecting the following types of hardware failures, and when they occur, moving protected pods to hardware that is functioning correctly:
Node failure. Node failure is defined to be similar to a Power Failure to the node which causes it to cease operation. This is differentiated from Node Communication Failures which require different treatments. Node failures are generally discovered by receipt of a Node event with a NoSchedule or NoExecute taint, or detection of such a taint when retrieving the Node via the K8S API.
Generally, it is difficult to distinguish from the outside if a node is truly down (not executing) versus it has lost connectivity on all its interfaces. (We might add capabilities in the future to query BIOS interfaces such as iDRAC, or perhaps periodically writing to file systems mounted in node-podmon to detect I/O failures, in order to get additional insight as to node status.) However if the node has simply lost all outside communication paths, the protected pods are possibly still running. We refer to these pods as “zombie pods”. CSM Resiliency is designed to deal with zombie pods in a way that prevents them interfering with replacement pods it may have made by fencing the failed nodes and when communication is restablished to the node, going through a cleaning procedure to remove the zombie pod artifacts before allowing the node to go back into service.
K8S Control Plane Network Failure. Control Plane Network Failure often has the same K8S failure signature (the node is tainted with NoSchedule or NoExecute), however if there is a separate Array I/O interface, CSM Resiliency can often detect that the Array I/O Network may be active even though the Control Plane Network is down.
Array I/O Network failure is detected by polling the array to determine if the array has a healthy connection to the node. The capabilities to do this vary greatly by array and communication protocol type (Fibre Channel, iSCSI, NFS, NVMe, or PowerFlex SDC IP protocol). By monitoring the Array I/O Network seperately from the Control Plane Network, CSM Resiliency has two different indicators of whether the node is healthy or not.
PowerFlex is a highly scalable array that is very well suited to Kubernetes deployments. The CSM Resiliency support for PowerFlex leverages the following PowerFlex features:
Very quick detection of Array I/O Network Connectivity status changes (generally takes 1-2 seconds for the array to detect changes)
A roboust mechanism if Nodes are doing I/O to volumes (sampled over a 5 second period).
Low latency REST API supports fast CSI provisioning and deprovisioning operations.
A proprietary network protocol provided by the SDC component that can run over the same IP interface as the K8S control plane or over a separate IP interface for Array I/O.
CSMi Resiliency Design
This section covers CSM Resiliency’s design in sufficient detail that you should be able to understand what CSM Resiliency is designed to do in various situations and how it works. CSM Resiliency is deployed as a sidecar named podmon with a CSI driver in both the controller pods and node pods. These are referred to as controller-podmon and node-podmon respectively.
Generally controller-podmon and the driver controller pods are deployed using a Deployment. The Deployments support one or multiple replicas for High Availability, and use a standard K8S leader election protocol so that only one controller is active at a time (as does the driver and all the controller sidecars.) The controller deployment also supports a Node Selector that allows the controllers to be placed on K8S Manager (non Worker) nodes.
Node-podmon and the driver node pods are deployed in a DaemonSet, with a Pod deployed on every K8S Worker Node.
Controller-podmon is responsible for:
Setting up a Watch for CSM Resiliency labeled pods, and if a Pod is Initialized but Not Ready and resident on a Node with a NoSchedule or NoExecute taint, calling controllerCleanupPod to clean up the pod so that a replacement pod can be scheduled.
Periodically polling the arrays to see if it has connectivity to the nodes that are hosting CSM Resiliency labeled pods (if enabled.) If an array has lost connectivity to a node hosting CSM Resiliency labeled pods using that array, controllerCleanupPod is invoked to clean up the pods that have lost I/O connectivity.
Tainting nodes that have failed so that a) no further pods will get scheduled to them until they are returned to service, and b) podmon-node upon seeing the taint will invoke the clean up operations to make sure any zombie pods (pods that have been replaced) cannot write to the volumes they were using.
If a CSM Resiliency labeled pod enters a CrashLoopBackOff state, deleting that pod so it can be replaced.
ControllerCleanupPod cleans up the pod by taking the following actions:
The VolumeAttachments (VAs) are loaded, and all VAs belonging to the pod being cleaned up are identified. The PVs for each VolumeAttachment are identified and used to get the Volume Handle (array identifier for the volume.)
If enabled, the array is queried if any of the volumes to the pod are still doing I/O. If so, cleanup is aborted.
The pod’s volumes are “fenced” from the node the pod resides on to prevent any potential I/O from a zombie pod. This is done by calling the CSI ControllerUnpublishVolume call for each of the volumes.
A taint is applied to the node to keep any new pods from being scheduled to the node. If the replacement pod were to get scheduled to the same node as a zombie pod, they might both gain access to the volume concurrently causing corruption.
The VolumeAttachments for the pod are deleted. This is necessary so the replacement pod to be created can attach the volumes.
The pod is forcibly deleted, so that a StatefulSet controller which created the pod is free to create a replacement pod.
Node-podmon has the following responsibilities:
Establishing a pod watch which is used to maintain a list of pods executing on this node that may need to be cleaned up. The list includes information about each Mount volume or Block volume used by the pod including the volume handle, volume name, private mount path, and mount path in the pod.
Periodically (every 30 seconds) polling to see if controller-podmon has applied a taint to the node. If so, node-podmon calls nodeModeCleanupPod for each pod to clean up any remnants of the pod (which is potentially a zombie pod.)
If all pods have been successfully cleaned up, and there are no labeled pods on this node still existing, only then will node-podmon remove the taint placed on the node by controller-podmon.
NodeModeCleanupPod cleans up the pod remnants by taking the following actions for each volume used by the pod:
Calling NodeUnpublishVolume to unpublish the volume from the pod.
Unmounting and deleting the target path for the volume.
Calling NodeUnstageVolume to unpublish the volume from the node.
Unmounting and deleting the staging path for the volume.
There are some limitations with the current design. Some might be able to be addressed in the future- others are inherent in the approach.
The design relies on the array’s ability to revoke access to a volume for a particular node for the fencing operation. The granularity of access control for a volume is per node. Consequently it isn’t possible to revoke access from one pod on a node while retaining access to another pod on the same node if we cannot communicate with the node. The implications of this are that if more than one pod on a node is sharing the same volume(s), they all must be protected by CSM Resiliency, and they all must be cleaned up by controller-podmon if the node fails. If only some of the pods are cleaned up, the other pods will lose access to the volumes shared with pods that have been cleaned, so those pods should also fail.
The node-podmon cleanup algorithm purposefully will not remove the node taint until all the protected volumes have been cleaned up from the node. This works well if the node fault lasts long enough that controller-podmon can evacuate all the protected pods from the node. However if the failure is short lived, and controller-podmon does not clean up all the protected pods on the node, or if for some reason node-podmon cannot clean a pod completely, the taint is left on the node, and manual intervention is required. The required intervention is for the operator to reboot the node, which will ensure that no zombie pods survive. Upon seeing the reboot, node-podmon will then remove the taint.
If the node failure is short lived, and controller-podmon has not evacuated some of the protected pods on the node, they may try and restart on the same pod. This has been observed to cause such pods to go into CrashLoopBackoff. We are currently considering solutions to this problem.
Deploying CSM Resiliency
CSM Resiliency is deployed as part of the CSI driver deployment. The drivers can be deployed either by a helm chart or by the Dell CSI Operator. For the alpha (Tech. Preview) phase, only helm chart installation is supported.
For information on the PowerFlex CSI driver, see (PowerFlex CSI Driver)[https://github.com/dell/csi-powerflex].
Configure all the helm chart parameters described below before deploying the drivers.
Helm Chart Installation
These installation instructions apply to the helm chart in the (PowerFlex CSI Driver)[https://github.com/dell/csi-powerflex] repository version v1.4.0. There was a change identified after the PowerFlex driver release that needs to be made to the helm chart, specifically to the file helm/csi-vxflexos/templates/node.yaml. It is a simple two line addition to the podmon container section of the chart. Please make this change before deploying podmon.
The diff is as follows:
@@ -113,8 +113,10 @@ spec:
– name: kubelet-pods
+ mountPropagation: “Bidirectional”
– name: driver-path
+ mountPropagation: “Bidirectional”
– name: usr-bin
For reference, the entire node.yaml file with the change applied is available here: node.yaml.
The drivers that support Helm chart deployment allow CSM Resiliency to be optionally deployed by variables in the chart. There is a podmon block specified in the values.yaml file of the chart that will look similar the text below by default:
# Podmon is an optional feature under development and tech preview.
# Enable this feature only after contact support for additional information
# – “-csisock=unix:/var/run/csi/csi.sock”
# – “-labelvalue=csi-vxflexos”
# – “-mode=controller”
# – “-csisock=unix:/var/lib/kubelet/plugins/vxflexos.emc.dell.com/csi_sock”
# – “-labelvalue=csi-vxflexos”
# – “-mode=node”
# – “-leaderelection=false”
To deploy CSM Resiliency with the driver, the following changes are requried:
Enable CSM Resiliency by changing the podmon.enabled boolean to true. This will enable both controller-podmon and node-podmon.
Specify the podmon image to be used as podmon.image.
Specify arguments to controller-podmon in the podmon.controller.args block. See “Podmon Arguments” below. Note that some arguments are required. Note that the arguments supplied to controller-podmon are different than those supplied to node-podmon.
Specify arguments to node-podmon in the podmon.node.args block. See “Podmon Arguments” below. Note that some arguments are required. Note that the arguments supplied to controller-podmon are different than those supplied to node-podmon.
|enabled||Yes||Boolean “true” enables Karavi Resiliency deployment with the driver in a helm installation.||top level|
|image||Yes||Must be set to a repository where the podmon image can be pulled.||controller & node|
|mode||Yes||Must be set to “controller” for controller-podmon and “node” for node-podmon.||controller & node|
|csisoc||Yes||This should be left as set in the helm template for the driver. For controller: “-csisock=unix:/var/run/csi/csi.sock”. For node it will vary depending on the driver’s identity, e.g. “-csisock=unix:/var/lib/kubelet/plugins/vxflexos.emc.dell.com/csi_sock”||controller & node|
|leaderelection||Yes||Boolean value that should be set true for controller and false for node. The default value is true.||controller & node|
|skipArrayConnectionValidation||Optional||Boolean value that if set to true will cause controllerPodCleanup to skip the validation that no I/O is ongong before cleaning up the pod.||controller|
|labelKey||Optional||String value that sets the label key used to denote pods to be monitored by Karavi Resiliency. It will make life easier if this key is the same for all driver types, and drivers are differentiated by different labelValues (see below). If the label keys are the same across all drivers you can do “kubectl get pods -A -l labelKey” to find all the Karavi Resiliency protected pods. labelKey defaults to “podmon.dellemc.com/driver”.||controller & node|
|labelValue||Yes||String that sets the value that denotes pods to be monitored by Karavi Resiliency. This must be specific for each driver. Defaults to “csi-vxflexos”||controller & node|
|arrayConnectivityPollRate||Optional||The minimum polling rate in seconds to determine if array has connectivity to a node. Should not be set to less than 5 seconds. See the specific section for each array type for additional guidance.||controller|
|arrayConnectivityConnectionLossThreshold||Optional||Gives the number of failed connection polls that will be deemed to indicate array connectivity loss. Should not be set to less than 3. See the specific section for each array type for additional guidance.||controller|
PowerFlex Specific Recommendations
PowerFlex supports a very robust array connection validation mechanism that can detect changes in connectivity in about two seconds and can detect whether I/O has occured over a five second sample. For that reason it is recommended to set “skipArrayConnectionValidation=false” (which is the default) and to set “arrayConnectivityPollRate=5” (5 seconds) and “arrayConnectivityConnectionLossThreshold=3” to 3 or more.
Here is a typical deployment used for testing:
Deploying and managing applications protected By CSM Resiliency
The first thing to remember about CSM Resiliency is that it only takes action on pods configured with the designated label. Both the key and the value have to match what is in the podmon helm configuration. CSM Resiliency emits a log message at startup with the label key and value it is using to monitor pods:
The above message indicates the key is: podmon.dellemc.com/driver and the label value is csi-vxflexos. To search for the pods that would be monitored, try this:
[root@lglbx209 podmontest]# kubectl get pods -A -l podmon.dellemc.com/driver=csi-vxflexos
NAMESPACE NAME READY STATUS RESTARTS AGE
pmtu1 podmontest-0 1/1 Running 0 3m7s
pmtu2 podmontest-0 1/1 Running 0 3m8s
pmtu3 podmontest-0 1/1 Running 0 3m6s
If CSM Resiliency detects a problem with a pod caused by a node or other failure that it can initiate remediation for, it will add an event to that pod’s events:
kubectl get events -n pmtu1…61s Warning NodeFailure pod/podmontest-0 podmon cleaning pod [7520ba2a-cec5-4dff-8537-20c9bdafbe26 node.example.com] with force delete…
CSM Resiliency may also generate events if it is unable to cleanup a pod for some reason. For example, it may not clean up a pod because the pod is still doing I/O to the array.
Before putting an application into production that relies on CSM Resiliency monitoring, it is important to do a few test failovers first. To do this take the node that is running the pod offline for at least 2-3 minutes. Verify that there is an event message similar to the one above is logged, and that the pod recovers and restarts normally with no loss of data. (Note that if the node is running many CSM Resiliency protected pods, the node may need to be down longer for CSM Resiliency to have time to evacuate all the protected pods.)
It is recommended that pods that will be monitored by CSM Resiliency be configured to exit if they receive any I/O errors. That should help achieve the recovery as quickly as possible.
CSM Resiliency does not directly monitor application health. However if standard Kubernetes health checks are configured, that may help reduce pod recovery time in the event of node failure, as CSM Resiliency should receive an event that the application is Not Ready. Note that a Not Ready pod is not sufficient to trigger CSM Resiliency action unless there is also some condition indicating a Node failure or problem, such as the Node is tainted, or the array has lost connectivity to the node.
As noted previously in the Limitations and Exclusions section, CSM Resiliency has not yet been verified to work with ReadWriteMany or ReadOnlyMany volumes. Also it has not been verified to work with pod controllers other than StatefulSet.
Recovering from Failures
Normally CSM Resiliency should be able to move pods that have been impacted by Node Failures to a healthy node, and after the failed nodes have come back on line, clean them up (especially any potential zombie pods) and then automatically remove the CSM Resiliency node taint that prevents pods from being scheduled to the failed node(s). There are a few cases where this cannot be fully automated and operator intervention is required, including:
CSM Resiliency expects that when a node faiure occurs, all CSM Resiliency labeled pods are evacuated and reschedule on other nodes. This process may not complete however if the node comes back online before CSM Resiliency has had time to evacuate all the labeled pods. The remaining pods may not restart correctly, going to “Error” or “CrashLoopBackoff”. We are considering some possible remediations for this condition but have not implemented them yet.
If this happens, try deleting the pod with “kubectl delete pod …”. In our experience this normally will cause the pod to be restarted and transition to the “Running” state.
Podmon-node is responsible for cleaning up failed nodes after the nodes communication has been restored. The algorithmm checks to see that all the monitored pods have terminated and their volumes and mounts have been cleaned up.
If some of the monitored pods are still executing, node-podmon will emit the following log message at the end of a cleanup cycle (and retry the cleanup after a delay):
pods skipped for cleanup because still present: <pod-list>
If this happens, DO NOT manually remove the the CSM Resiliency node taint. Doing so could possibly cause data corruption if volumes were not cleaned up and a pod using those volumes was subsequently scheduled to that node.
The correct course of action in this case is to reboot the failed node(s) that have not removed their taints in a reasonable time (5-10 minutes after the node is online again.) The operator can delay executing this reboot until it is convenient, but new pods will not be scheduled to it in the interim. This reboot will kill any potential zombie pods. After the reboot, node-podmon should automatically remove the node taint after a short time.
You can access the github repo, by clicking the screenshot below (GitHub – dell/karavi-resiliency: A Kubernetes pod monitor for safely terminating pods with persistent volumes in case of node failures)
And below, you can see a demo, showing how it all works