Cluster Operations
Here is the list of operations managed by CassKop at the Cluster level which have a dedicated status in each racks.
Those operations are applied at the Cassandra cluster level, as opposite to Pod operations that are executed at pod
level and are discussed in the next section.
Cluster Operations must only be triggered by a change made on the CassandraCluster
object.
Some updates in the CassandraCluster
CRD object are forbidden and will be gently dismissed by CassKop:
spec.dataCapacity
spec.dataStorage
Some Updates in the CassandraCluster
CRD object will trigger a rolling update of the whole cluster such as :
spec.resources
spec.baseImage
spec.version
spec.configMap
spec.runAsUser
spec.fsGroup
Some Updates in the CassandraCluster
CRD object will not trigger change on the cluster but only in future behavior of
CassKop :
spec.autoPilot
spec.autoUpdateSeedList
spec.deletePVC
spec.hardAntiAffinity
spec.rollingPartition
spec.maxPodUnavailable
noCheckStsAreEqual
CassKop manages rolling updates for each statefulset in the cluster. Then each statefulset is making the rolling
updated of it's pod according to the partition
defined for each statefulset in
the spec.topology.dc[].rack[].rollingPartition
.
#
InitializingThe First Operation required in a Cassandra Cluster is the initialization.
In this Phase, the CassKop will create the CassandraCluster.Status
section with an entry for each DC/Rack declared
in the CassandraCluster.spec.topology
section.
We could also have Initializing status if we decided later to add some DC to our topology.
topology
defined#
With no For demo we will create this CassandraCluster without topology section
If no
topology
has been specified, then CassKop creates the default topology and status.
The default topology added by CassKop is :
The number of cassandra nodes CassandraCluster.spec.nodesPerRacks
defines the number of cassandra nodes CassKop
must create in each of it's racks. In our example, there is only one default rack, so CassKop will only create 2
nodes.
important
with the default topology there will be no Kubernetes NodesAffinity to spread the Cassandra nodes on the cluster. In this case, CassKop will only create one Rack and one DC for Cassandra. It is not recommended as you may lose data in case of hardware failure
When Initialization has ended you should have a Status similar to :
- The Status of the
dc1-rack1
isInitializing=Done
- The Status of the Cluster is
Initializing=Done
- The phase is
Running
which means that each Rack has the desired amount of Nodes.
We asked 2 nodesPerRacks
and we have one default rack, so we ended with 2 Cassandra nodes in our cluster.
The Cassandra seedlist
has been initialized and stored in the CassandraCluster.status.seedlist`. It has also been
configured in each of the Cassandra Pods.
We can also confirm that Cassandra knows about the DC and Rack name we have deployed :
topology
defined#
With In this example, I added a topology defining 2 Cassandra DC and 3 racks in total
With this topology section I also references some Kubernetes nodes labels which will be used to spread the Cassandra nodes on each Racks on different groups of Kubernetes servers.
note
We can see here that we can give specific configuration for the number of pods in the dc2 (nodesPerRacks: 3
)
We also allow to configure Cassandra pods with different num_tokens confioguration for each dc using the appropriate
parameter in the config.
CassKop will create a statefulset for each Rack, and start creating the Cassandra Cluster, starting by nodes from the Rack 1. When CassKop will end operations on Rack1, it will process the next rack and so on.
The status may be similar to :
The creation of the cluster is ongoing. We can see that, regarding the Cluster Topology, CassKop has created the SeedList.
tip
CassKop compute a seedlist with 3 nodes in each datacenter (if possible). The Cassandra seeds are always the first Cassandra nodes of a statefulset (starting with index 0).
When all racks are in status done, then the CassandraCluster.status.lastClusterActionStatus
is changed to Done
.
We can see that internally Cassandra also knows the desired topology :
#
UpdateConfigMapYou can find in the cassandra-configuration section how you can use
the spec.configMap
parameter.
important
Actually CassKop doesn't monitor changes inside the ConfigMap. If you want to change a parameter in a file in the current configMap, you must create a new configMap with the updated version, and then ask CassKop to use the new configmap name.
If we add/change/remove the CassandraCluster.spec.configMapName
then CassKop will start a RollingUpdate of each
CassandraNodes in each Racks, starting from the first Rack defined in the topology
.
First we need to create the configmap exemple:
Then we apply the changes in the CassandraCluster
.
We can see the CassandraCluster.Status
updated by CassKop
note
CassKop won't make a rolling update on the next rack until the status of the current rack becomesDone
.
The Operation is processing "rack per rack".
#
UpdateDockerImageCassKop allows you to change the Cassandra docker image and gracefully redeploy your whole cluster.
If we change the CassandraCluster.spec.baseImage
and or CassandraCluster.spec.version
CassKop will start to
perform a RollingUpdate on the whole cluster (for each racks sequentially, in order to change the version of the
Cassandra Docker Image on all nodes.
You can change the docker image used to :
- change the version of Cassandra
- change the version of Java
- Change some configuration parameters for cassandra or jvm if you don't overwrite them with a ConfigMap
The status may be similar to:
We can see that CassKop has started to Update the dc1-rack1
and it has changed the lastClusterAction
and
lastClusterStatus
accordingly.
Once it has finished the first rack, then it processes the next one:
And when all racks are Done:
This provides a Central view to monitor what is happening on the Cassandra Cluster.
#
UpdateResourcesCassKop allows you to configure your Cassandra's pods resources (memory and cpu).
If we change the CassandraCluster.spec.resources
, then CassKop will start to make a RollingUpdate on the whole
cluster (for each racks sequentially) to change the version of the Cassandra Docker Image on all nodes.
See section
For example, to increase Memory/CPU requests and/or limits:
Then CassKop should output the status:
We can see that it has staged the UpdateResources
action in all racks (status=ToDo
) and has started the action in
the first rack (status=Ongoing
). Once Done
it will follow with next rack, and so on.
Upon completion, the status may look like :
#
Scaling the clusterThe Scaling of the Cluster is managed through the nodesPerRacks parameters and through the number of Dcs and Racks defined in the Topology section.
See section NodesPerRacks
note
if the ScaleUp (or the ScaleDown) may change the SeedList and if spec.autoUpdateSeedList
is set to true
then CassKop will program a new operation : UpdateSeedList
which will trigger a rollingUpdate to apply the new
seedlist on all nodes, once the Scaling is done.
#
ScaleUpCassKop allows you to Scale Up your Cassandra cluster.
There is a global parameter CassandraCluster.spec.nodesPerRacks
which specify the number of Cassandra nodes we want in
a rack.
It is possible to surcharge this for a particular DC in the CassandraCluster.spec.topology.dc[<idx>].nodesPerRacks
Example:
In this case, we ask to ScaleUp nodes of second DC dc2
CassKop takes into account the new target, and starts applying modifications in the cluster :
We can see that CassKop:
- has started the
ScaleUp
action indc2-rack1
- has found that the SeedList must be updated, and because the autoUpdateSeedList=true it has staged
(
status=Configuring
) the UpdateSeedList operation fordc1-rack1
anddc1-rack2
When CassKop ends the ScaleUp action in the dc2-rack1
then it will also stage this rack with UpdateSeedList=Configuring
.
Once all racks are in this state, CassKop will turn each Rack in status UpdateSeedList=ToDo
, meaning that it can
start the operation.
Starting from then, CassKop will iterate on each rack one after the other and get status :
UpdateSeedList=Ongoing
meaning that it is currently doing a rolling update on the Rack to update the SeedList seting also sets thestartTime
.UpdateSeedList=Done
meaning that the operation is done. (then, it sets theendTime
)
See evolution of status:
Here is the final topology seen from nodetool :
Note that nodetool prints IP of nodes while kubernetes works with names :
After the ScaleUp has finished, CassKop must execute a cassandra cleanup
on each nodes of the Cluster.
This can be manually triggered by setting appropriate labels on each Pods.
CassKop can automate this if spec.autoPilot
is true by setting the labels on each Pods of the cluster with a ToDo
state and then find thoses pods to sequentially execute thoses actions.
See podOperation Cleanup!!
#
UpdateScaleDownFor ScaleDown, CassKop must perform a clean cassandra decommission
prior to actually scale down the cluster at
Kubernetes level.
Actually, this is done through CassKop asking the decommission through a jolokia call and waiting for it to be performed (cassandra node status = decommissionned) before updating kubernetes statefulset (removing the pod).
important
If we ask to scale down more than 1 node at a time, then CassKop will iterate on a single scale down until it reaches the requested number of nodes.
Also CassKop will refuse a scaledown to 0 for a DC if there still have some data replicated to it.
To launch a ScaleDown, we simply need to decrease the value of nodesPerRacks.
We can see in the below example that:
- It has started the
ScaleDown
action indc2-rack1
- CassKop has found that the SeedList must be updated, and it has staged (
status=ToDo
) it fordc1-rack1
anddc1-rack2
When CassKop completes the ScaleDown in the dc2-rack1
then it will stage it also with UpdateSeedList=ToDo
Once all
racks are in this state, CassKop will turn each Rack in status UpdateSeedList=Ongoing
meaning that it can start the
operation, it also set the startTime
Then, CassKop will iterate on each rack one after the other and get status :
UpdateSeedList=Finalizing
meaning that it is currently doing a rolling update on the Rack to update the SeedListUpdateSeedList=Done
meaning that the operation is done. Then, it sets theendTime
.
When ScaleDown=Done
CassKop will start the UpdateSeedList operation.
It shows also that podLastOperation
decommission
is Done
. CassKop will then rollingUpdate all racks one by one
in order to update the Cassandra seedlist.
#
UpdateSeedListThe UpdateSeedList is done automatically by CassKop when the parameter
CassandraCluster.spec.autoUpdateSeedList
is true (default).
#
CorrectCRDConfigThe CRD CassandraCluster
is used to define your cluster configuration. Some fields can't be updated in a kubernetes
clusters. Some fields are taken from the CRD to configure thoses objects, and to be sure we don't update them (to
prevent kubernetes objects in errors), we have configure CassKop to simply ignore/revert unauthorized changed to the
CRD.
Example With this CRD deployed :
If we try to update the dataCapacity
or dataStorageClass
nothing will happen. And we could see thoses messages in
the logs of CassKop :
If you performed the modification by updating your local CRD file and apply it with kubectl you must revert to the old value.
#
Delete a DC- Prior to delete a DC, you must have ScaleDown to 0 all the Racks, if not, CassKop will refuse and correct the CRD.
- Prior to scaleDown to 0 CassKop will ensure that there are no more data replicated to the DC, if not, CassKop will refuse and correct the CRD. Because CassKop wants that we have the same amounts of pods in all racks, we decided that we would't allow to remove only a rack. This will be revert too.
important
You must ScaleDown to 0 before you remoove a DC You must change replication factor before doing a ScaleDown to 0 for a DC
#
Kubernetes node maintenance operationIn a normal production environment, CassKop will have spread it's Cassandra pods on differents k8s nodes. If the team in charge of the machines needs to make some operations on a host they can make a drain.
The Kubernetes drain command will ask the scheduler to make an eviction for all pods on the current nodes, and for many workloads k8s will reschedule them on other machines. In the case of CassKop cassandra pods, they won't be scheduled on another host, because they uses local-storage and are stick to a specific host thanks to the PersistentVolumeClaim kubernetes object.
Example: we drain the node008 for a maintenance operation.
All pods will be evicted, thoses who can will be rescheduled on another hosts. Our Cassandra pod won't we able to be schedule elsewhere due to the PVS, and we can see this messages in the k8s events :
It explain that 1 node is unshedulable, this is the one we just drain. the 5 other nodes can't be scheduled by our pod because they have volume node affinity conflict ()our pods have an affinity on node008).
Once the team have finished their maintenance operation they can bring back the host into the kubernetes cluster. From then, k8s will be able to reshedule back the cassandra pod into the cluster so that it can re-join the ring.
Immediately the pending pod is rescheduled and started on the host. If the time of interruption was not too long there is nothing more to do, the node will join the ring and re-synchronise with the cluster. If the time was too long, then it may be needed to schedule some PodOperations that you will find in nexts sections of this document.
#
The PodDisruptionBudget (PDB) protectionIf a k8s admin ask to drain a node, this may not been allowed by the cassandracluster regarding it's current state and the configuration of its PDB (usually only 1 nodes allowed to be in disruption).
Example :
The node008 will be flagged as SchedulingDisabled, so that it won't take new workload. It will evict all possible pods, but if there was an ongoing disruption on the current Cassandra cluster, it won't be allowed to evict the cassandra pod.
Example of a PDB :
In this example we see that we allowed only 1 Pod unavailable, and on our cluster we wants to have 14 pods and we only have 13 healthy, that's why the PDB won't allow the eviction of an additional pod.
To be able to continue, we need to wait or to make appropriate actions so that the Cassandra cluster won't have any unavailable nodes.
#
K8S host major failure: replacing a cassandra nodeIn the case of a major host failure, it may not be possible to bring back the node to life. We can in this case consider that our cassandra node is lost and we will want to replace it on another host.
In this case we may have 2 solutions that will require some manual actions :
#
Remove old node and create new one- In this case we will use CassKop client to schedule a cassandra removenode for the failing node.
This will trigger the PodOperation removenode by setting the appropriate labels on a cassandra Pod.
- Once the node is properly removed, we can free the link between the Pod and the failing host by removing the associated PodDisruptionBudget
This will allow Kubernetes to reschedule the Pod on another free host.
- Once the node is back in the cluster we need to apply a cleanup on all nodes
you can pause the cleanup and check status with
#
Replace node with a new oneIn some cases It may be useful to prefer to replace the node. Because we use a statefulset to deploy cassandra pods, by definition all pods are identical and we couldn't execute specific actions on a specific node at startup.
For that CassKop provide the ability to execute a pre_run.sh
script that can be change using the CRD ConfigMap.
To see how to use the configmap see Overriding Configuration using configMap
for example If we want to replace the node cassandra-test-dc1-rack2-1, we first need to retrieve it's IP address from nodetool status for example :
Then we can edit the ConfigMap to edit the pre_run.sh script :
So the Operation will be :
- Edit the configmap with the appropriate CASSANDRA_REPLACE_NODE IP for the targeted pod name
- delete the pvc data-cassandra-test-dc1-rack2-1
- the Pod will boot, execute the pre_run.sh script prior to the /run.sh
- the new pod replace the dead one by re-syncing the content which could take some times depending on the data size.
- Do not forget to edit again the ConfigMap and to remove the specific line with replace_node instructions.