Operations Issues
#
Operator can't perform the ActionIf you ask to scale up or add a new DC, or ask for more resources, CassKop will ask Kubernetes to schedule as you requested. But sometimes it is not possible to achieve the change because of a lack of resources (memory/cpus) or because constraints can't be satisfied (Kubernetes nodes with specific labels available...)
CassKop make uses of the PodDisruptionBudget to prevent CassKop to make some change on the CassandraCluster that could make more than 1 Cassandra node at a time.
If you have a Pod stuck in pending state, then you have at least 1 Pod in Disruption, and the PDB object will prevent you to make changes on statefulset because that mean that you will have more than 1 Cassandra down at a time.
The Operator logs this line when there is disruption on the Cassandra cluster:
#
Can't ScaleUpIn this example I ask a ScaleUp but it can't perform :
the cassandra-demo-dc1-rack1-1 pod is Pending and can't be scheduled.
If we looked at the pod status we will see this message :
Kubernetes can't find any Pod with sufficient cpu and matching kubernetes nodes labels we asked in the topology section.
To fix this, we can either:
- reduce memory/cpu limits
- add more kubernetes nodes that will satisfied our requirements.
- rollback the scaleUp Operation
At this point, CassKop will wait indefinitely to the case to be Fix
#
Rollback ScaleUp operationIn order to rollback the operation, we need to revert the change on the nodesPerRacks
parameter.
This is not sufficient
Because CassKop is actually performing another action on the cluster (ScaleUp) we can't scheduled a new operation to rollback since it has not finished. We introduced a new parameter in the CRD to allow such changes when all the pods can't be scheduled:
Spec.unlockNextOperation: true
๐ฉ Warning This is not a regular parameter and it must be used with very good care!!.
By adding this parameter in our cluster definition, CassKop will allow to trigger a new operation.
Once CassKop has scheduled the new operation, it will reset this parameter to the default
false
value. value. If you need more operation, you will need to reset the parameter to force another Operation. Keep in mind that CassKop is mean to do only 1 operation at a time.
If this is not already done, you can now rollback the scaleUp updating nodesPerRacks: 1
#
Can't add new rack in new DCIn this example, I ask to add dc called dc2
But the last one can't be scheduled because of insufficient cpu on k8s nodes.
We can either add the wanted resources in the k8s cluster or make a rollback.
#
Solution1: rollback adding the DCTo rollback the add of the new DC, we first need to scale down to 0 for the nodes that already have join the ring. We need to allow disruption as we do in previous section.
then We first need to ask the dc2 to scaleDown to 0 because it has already add 2 racks, and we add the spec.unlockNextOperation to true.
This will allow CassKop to make the scale down. because it will start with the first rack, it will free some space and the last pod which was pending will be joining. then it will be decommissioned by CassKop.
we can see in CassKop logs when it deal with the rack with unscheduled pods:
It will also ScaleDown any pods that was part of the new DC.
Once ScaleDown is done, you can delete the DC from the Spec.
#
Solution2: change the topology for dc2 (remove the 3rd unschedulable rack)we get back in the previous section before making the rolling back of adding the dc2.
If only one of the Racks can't schedule any pods, we can change the topology to remove the Rack ONLY if there was not already pods deployed in the Rack. If this is not the case, then you will need to process ScaleDown instead of removing rack.
let's remove the rack3 from dc2.
the operator will log :
The rack3 (and its statefulset) has been removed, and the associated (empty) pvc deleted