High availability (HA) is an important aspect of any production deployment. In the context of Kubernetes, HA is achieved by deploying multiple nodes for workers as well as masters. This ensures that in case of node failures, the workload can be distributed to other nodes, ensuring high availability.
In the case of AMCOP deployment on Kubernetes, HA is essential to ensure that all services are still reachable in the event of node failures. To validate this, we deployed AMCOP on a multi-node cluster and simulated a graceful shutdown of nodes. During this process, we ran continuous tests that accessed various services to ensure they were still available, including:
- Cluster automation: This continuously adds and removes clusters to/from AMCOP. The idea here is that the script will ensure the availability of all cluster management services in the EMCO module of AMCOP.
- CDS Workflow: This continuously tests CBA (Controller Blueprint Archive) in a loop. This ensures that CDS, Camunda and mariadb pods are live and responding.
- SMO: Similar tests to run basic configuration operations on CU/DU simulators.
To achieve HA, we recommend the following configuration:
- Deploy multiple worker nodes: This ensures that workload can be distributed across multiple nodes and avoids a single point of failure.
- Deploy multiple master nodes: This ensures that if a master node fails, there are other nodes available to take over the workload.
- Use a load balancer: This ensures that requests are distributed evenly across all nodes, preventing any one node from becoming overwhelmed.
It's important to note that while k8s has built-in resilience to handle node failures, there are certain cases where administrator intervention is needed, particularly for stateful applications and persistent volumes. In these cases, it's important to have a disaster recovery plan in place to minimize downtime and ensure data integrity.
In conclusion, HA deployment on Kubernetes is crucial to ensure high availability of services and to minimize downtime in the event of node failures. Continuous testing and monitoring can help ensure that all services are still reachable, and a disaster recovery plan can help minimize the impact of any hardware failures. By following these best practices, AMCOP deployments can ensure a high level of reliability and availability.