Kafka Broker Pod Failure
Introduction¶
- It causes (forced/graceful) pod failure of specific/random Kafka broker pods
- It tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
- It tests unbroken message stream when KAFKA_LIVENESS_STREAM experiment environment variable is set to enabled
Scenario: Deletes kafka broker pod
Uses¶
View the uses of the experiment
coming soon
Prerequisites¶
Verify the prerequisites
- Ensure that Kubernetes Version > 1.16
- Ensure that the Litmus Chaos Operator is running by executing
kubectl get pods
in operator namespace (typically,litmus
).If not, install from here - Ensure that the
kafka-broker-pod-failure
experiment resource is available in the cluster by executingkubectl get chaosexperiments
in the desired namespace. If not, install from here - Ensure that Kafka & Zookeeper are deployed as Statefulsets
- If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be
used as the value of
KAFKA_INSTANCE_NAME
experiment environment variable- In case of Confluent, specified by the
--name
flag - In case of Kudo, specified by the
--instance
flag Zookeeper uses this to construct a path in which kafka cluster data is stored.
- In case of Confluent, specified by the
Default Validations¶
View the default validations
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Message stream (if enabled) is unbroken
Minimal RBAC configuration example (optional)¶
NOTE
If you are using this experiment as part of a litmus workflow scheduled constructed & executed from chaos-center, then you may be making use of the litmus-admin RBAC, which is pre installed in the cluster as part of the agent setup.
View the Minimal RBAC permissions
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kafka-broker-pod-failure-sa
namespace: default
labels:
name: kafka-broker-pod-failure-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kafka-broker-pod-failure-sa
labels:
name: kafka-broker-pod-failure-sa
app.kubernetes.io/part-of: litmus
rules:
# Create and monitor the experiment & helper pods
- apiGroups: [""]
resources: ["pods"]
verbs: ["create","delete","get","list","patch","update", "deletecollection"]
# Performs CRUD operations on the events inside chaosengine and chaosresult
- apiGroups: [""]
resources: ["events"]
verbs: ["create","get","list","patch","update"]
# Fetch configmaps & secrets details and mount it to the experiment pod (if specified)
- apiGroups: [""]
resources: ["secrets","configmaps"]
verbs: ["get","list",]
# Track and get the runner, experiment, and helper pods log
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get","list","watch"]
# for creating and managing to execute comands inside target container
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["get","list","create"]
# for deriving the parent/owner details of the pod
- apiGroups: ["apps"]
resources: ["deployments","statefulsets"]
verbs: ["list","get"]
# for configuring and monitor the experiment job by the chaos-runner pod
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create","list","get","delete","deletecollection"]
# for creation, status polling and deletion of litmus chaos resources used within a chaos workflow
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kafka-broker-pod-failure-sa
labels:
name: kafka-broker-pod-failure-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kafka-broker-pod-failure-sa
subjects:
- kind: ServiceAccount
name: kafka-broker-pod-failure-sa
namespace: default
Experiment tunables¶
check the experiment tunables
Mandatory Fields
Variables | Description | Notes |
---|---|---|
KAFKA_NAMESPACE | Namespace of Kafka Brokers | May be same as value for spec.appinfo.appns |
KAFKA_LABEL | Unique label of Kafka Brokers | May be same as value for spec.appinfo.applabel |
KAFKA_SERVICE | Headless service of the Kafka Statefulset | |
KAFKA_PORT | Port of the Kafka ClusterIP service | |
ZOOKEEPER_NAMESPACE | Namespace of Zookeeper Cluster | May be same as value for KAFKA_NAMESPACE or other |
ZOOKEEPER_LABEL | Unique label of Zokeeper statefulset | |
ZOOKEEPER_SERVICE | Headless service of the Zookeeper Statefulset | |
ZOOKEEPER_PORT | Port of the Zookeeper ClusterIP service |
Optional Fields
Variables | Description | Notes |
---|---|---|
KAFKA_BROKER | Kafka broker pod (name) to be deleted | A target selection mode (random/liveness-based/specific) |
KAFKA_KIND | Kafka deployment type | Same as spec.appinfo.appkind . Supported: statefulset |
KAFKA_LIVENESS_STREAM | Kafka liveness message stream | Supported: enabled , disabled |
KAFKA_LIVENESS_IMAGE | Image used for liveness message stream | Set the liveness image as <registry_url>/<repository>:<image-tag> |
KAFKA_REPLICATION_FACTOR | Number of partition replicas for liveness topic partition | Necessary if KAFKA_LIVENESS_STREAM is enabled . The replication factor should be less than or equal to number of Kafka brokers |
KAFKA_INSTANCE_NAME | Name of the Kafka chroot path on zookeeper | Necessary if installation involves use of such path |
KAFKA_CONSUMER_TIMEOUT | Kafka consumer message timeout, post which it terminates | Defaults to 30000ms, Recommended timeout for EKS platform: 60000 ms |
TOTAL_CHAOS_DURATION | The time duration for chaos insertion (seconds) | Defaults to 15s |
CHAOS_INTERVAL | Time interval b/w two successive broker failures (sec) | Defaults to 5s |
Experiment Examples¶
Common Experiment Tunables¶
Refer the common attributes to tune the common tunables for all the experiments.
Kafka And Zookeeper App Details¶
It contains kafka and zookeeper application details:
KAFKA_NAMESPACE
: Namespace where kafka is installedKAFKA_LABEL
: Labels of the kafka applicationKAFKA_SERVICE
: Name of the kafka serviceKAFKA_PORT
: Port of the kafka serviceZOOKEEPER_NAMESPACE
: Namespace where zookeeper is installedZOOKEEPER_LABEL
: Labels of the zookeeper applicationZOOKEEPER_SERVICE
: Name of the zookeeper serviceZOOKEEPER_PORT
: Port of the zookeeper serviceKAFKA_BROKER
: Name of the kafka broker podKAFKA_REPLICATION_FACTOR
: Replication factor of the kafka application
Use the following example to tune this:
## details of the kafka and zookeeper
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "kafka"
applabel: "app=cp-kafka"
appkind: "statefulset"
chaosServiceAccount: kafka-broker-pod-failure-sa
experiments:
- name: kafka-broker-pod-failure
spec:
components:
env:
# namespace where kafka installed
- name: KAFKA_NAMESPACE
value: 'kafka'
# labels of the kafka
- name: KAFKA_LABEL
value: 'app=cp-kafka'
# name of the kafka service
- name: KAFKA_SERVICE
value: 'kafka-cp-kafka-headless'
# kafka port number
- name: KAFKA_PORT
value: '9092'
# namespace of the zookeeper
- name: ZOOKEEPER_NAMESPACE
value: 'default'
# labels of the zookeeper
- name: ZOOKEEPER_LABEL
value: 'app=cp-zookeeper'
# name of the zookeeper service
- name: ZOOKEEPER_SERVICE
value: 'kafka-cp-zookeeper-headless'
# port of the zookeeper service
- name: ZOOKEEPER_PORT
value: '2181'
# name of the kafka broker
- name: KAFKA_BROKER
value: 'kafka-0'
# kafka replication factor
- name: KAFKA_REPLICATION_FACTOR
value: '3'
# duration of the chaos
- name: TOTAL_CHAOS_DURATION
VALUE: '60'
Liveness check of kafka¶
- The kafka liveness can be tuned with
KAFKA_LIVENESS_STREAM
env. ProvideKAFKA_LIVENESS_STREAM
asenable
to enable the liveness check and provideKAFKA_LIVENESS_STREAM
asdisable
to skip the liveness check. The default value isdisable
. - The Kafka liveness image can be provided at
KAFKA_LIVENESS_IMAGE
. - The kafka liveness pod contains producer and consumer to validate the message stream during the chaos. The timeout for the consumer can be tuned with
KAFKA_CONSUMER_TIMEOUT
.
Use the following example to tune this:
## checks the kafka message liveness while injecting chaos
## sets the consumer timeout
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "kafka"
applabel: "app=cp-kafka"
appkind: "statefulset"
chaosServiceAccount: kafka-broker-pod-failure-sa
experiments:
- name: kafka-broker-pod-failure
spec:
components:
env:
# check for the kafa liveness message stream during chaos
# supports: enable, disable. default value: disable
- name: KAFKA_LIVENESS_STREAM
value: 'enable'
# timeout of the kafka consumer
- name: KAFKA_CONSUMER_TIMEOUT
value: '30000' # in ms
# image of the kafka liveness pod
- name: KAFKA_LIVENESS_IMAGE
value: ''
- name: KAFKA_NAMESPACE
value: 'kafka'
- name: KAFKA_LABEL
value: 'app=cp-kafka'
- name: KAFKA_SERVICE
value: 'kafka-cp-kafka-headless'
- name: KAFKA_PORT
value: '9092'
- name: ZOOKEEPER_NAMESPACE
value: 'default'
- name: ZOOKEEPER_LABEL
value: 'app=cp-zookeeper'
- name: ZOOKEEPER_SERVICE
value: 'kafka-cp-zookeeper-headless'
- name: ZOOKEEPER_PORT
value: '2181'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'
Mutiple Iterations Of Chaos¶
The multiple iterations of chaos can be tuned via setting CHAOS_INTERVAL
ENV. Which defines the delay between each iteration of chaos.
Use the following example to tune this:
# defines delay between each successive iteration of the chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "kafka"
applabel: "app=cp-kafka"
appkind: "statefulset"
chaosServiceAccount: kafka-broker-pod-failure-sa
experiments:
- name: kafka-broker-pod-failure
spec:
components:
env:
# delay between each iteration of chaos
- name: CHAOS_INTERVAL
value: '15'
# time duration for the chaos execution
- name: TOTAL_CHAOS_DURATION
VALUE: '60'
- name: KAFKA_NAMESPACE
value: 'kafka'
- name: KAFKA_LABEL
value: 'app=cp-kafka'
- name: KAFKA_SERVICE
value: 'kafka-cp-kafka-headless'
- name: KAFKA_PORT
value: '9092'
- name: ZOOKEEPER_NAMESPACE
value: 'default'
- name: ZOOKEEPER_LABEL
value: 'app=cp-zookeeper'
- name: ZOOKEEPER_SERVICE
value: 'kafka-cp-zookeeper-headless'
- name: ZOOKEEPER_PORT
value: '2181'