Skip to content

Kafka Broker Pod Failure

Introduction

  • It causes (forced/graceful) pod failure of specific/random Kafka broker pods
  • It tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
  • It tests unbroken message stream when KAFKA_LIVENESS_STREAM experiment environment variable is set to enabled

Scenario: Deletes kafka broker pod

Kafka Broker Pod Delete

Uses

View the uses of the experiment

coming soon

Prerequisites

Verify the prerequisites
  • Ensure that Kubernetes Version > 1.16
  • Ensure that the Litmus Chaos Operator is running by executing kubectl get pods in operator namespace (typically, litmus).If not, install from here
  • Ensure that the kafka-broker-pod-failure experiment resource is available in the cluster by executing kubectl get chaosexperiments in the desired namespace. If not, install from here
  • Ensure that Kafka & Zookeeper are deployed as Statefulsets
  • If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of KAFKA_INSTANCE_NAME experiment environment variable
    • In case of Confluent, specified by the --name flag
    • In case of Kudo, specified by the --instance flag Zookeeper uses this to construct a path in which kafka cluster data is stored.

Default Validations

View the default validations
  • Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
  • Kafka Message stream (if enabled) is unbroken

Minimal RBAC configuration example (optional)

NOTE

If you are using this experiment as part of a litmus workflow scheduled constructed & executed from chaos-center, then you may be making use of the litmus-admin RBAC, which is pre installed in the cluster as part of the agent setup.

View the Minimal RBAC permissions

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kafka-broker-pod-failure-sa
  namespace: default
  labels:
    name: kafka-broker-pod-failure-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kafka-broker-pod-failure-sa
  labels:
    name: kafka-broker-pod-failure-sa
    app.kubernetes.io/part-of: litmus
rules:
  # Create and monitor the experiment & helper pods
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create","delete","get","list","patch","update", "deletecollection"]
  # Performs CRUD operations on the events inside chaosengine and chaosresult
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create","get","list","patch","update"]
  # Fetch configmaps & secrets details and mount it to the experiment pod (if specified)
  - apiGroups: [""]
    resources: ["secrets","configmaps"]
    verbs: ["get","list",]
  # Track and get the runner, experiment, and helper pods log 
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get","list","watch"]  
  # for creating and managing to execute comands inside target container
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["get","list","create"]
  # for deriving the parent/owner details of the pod   
  - apiGroups: ["apps"]
    resources: ["deployments","statefulsets"]
    verbs: ["list","get"]
  # for configuring and monitor the experiment job by the chaos-runner pod
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create","list","get","delete","deletecollection"]
  # for creation, status polling and deletion of litmus chaos resources used within a chaos workflow
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines","chaosexperiments","chaosresults"]
    verbs: ["create","list","get","patch","update","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kafka-broker-pod-failure-sa
  labels:
    name: kafka-broker-pod-failure-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kafka-broker-pod-failure-sa
subjects:
- kind: ServiceAccount
  name: kafka-broker-pod-failure-sa
  namespace: default
Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.

Experiment tunables

check the experiment tunables

Mandatory Fields

Variables Description Notes
KAFKA_NAMESPACE Namespace of Kafka Brokers May be same as value for spec.appinfo.appns
KAFKA_LABEL Unique label of Kafka Brokers May be same as value for spec.appinfo.applabel
KAFKA_SERVICE Headless service of the Kafka Statefulset
KAFKA_PORT Port of the Kafka ClusterIP service
ZOOKEEPER_NAMESPACE Namespace of Zookeeper Cluster May be same as value for KAFKA_NAMESPACE or other
ZOOKEEPER_LABEL Unique label of Zokeeper statefulset
ZOOKEEPER_SERVICE Headless service of the Zookeeper Statefulset
ZOOKEEPER_PORT Port of the Zookeeper ClusterIP service

Optional Fields

Variables Description Notes
KAFKA_BROKER Kafka broker pod (name) to be deleted A target selection mode (random/liveness-based/specific)
KAFKA_KIND Kafka deployment type Same as spec.appinfo.appkind. Supported: statefulset
KAFKA_LIVENESS_STREAM Kafka liveness message stream Supported: enabled, disabled
KAFKA_LIVENESS_IMAGE Image used for liveness message stream Set the liveness image as <registry_url>/<repository>:<image-tag>
KAFKA_REPLICATION_FACTOR Number of partition replicas for liveness topic partition Necessary if KAFKA_LIVENESS_STREAM is enabled. The replication factor should be less than or equal to number of Kafka brokers
KAFKA_INSTANCE_NAME Name of the Kafka chroot path on zookeeper Necessary if installation involves use of such path
KAFKA_CONSUMER_TIMEOUT Kafka consumer message timeout, post which it terminates Defaults to 30000ms, Recommended timeout for EKS platform: 60000 ms
TOTAL_CHAOS_DURATION The time duration for chaos insertion (seconds) Defaults to 15s
CHAOS_INTERVAL Time interval b/w two successive broker failures (sec) Defaults to 5s

Experiment Examples

Common Experiment Tunables

Refer the common attributes to tune the common tunables for all the experiments.

Kafka And Zookeeper App Details

It contains kafka and zookeeper application details:

  • KAFKA_NAMESPACE: Namespace where kafka is installed
  • KAFKA_LABEL: Labels of the kafka application
  • KAFKA_SERVICE: Name of the kafka service
  • KAFKA_PORT: Port of the kafka service
  • ZOOKEEPER_NAMESPACE: Namespace where zookeeper is installed
  • ZOOKEEPER_LABEL: Labels of the zookeeper application
  • ZOOKEEPER_SERVICE: Name of the zookeeper service
  • ZOOKEEPER_PORT: Port of the zookeeper service
  • KAFKA_BROKER: Name of the kafka broker pod
  • KAFKA_REPLICATION_FACTOR: Replication factor of the kafka application

Use the following example to tune this:

## details of the kafka and zookeeper
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  appinfo:
    appns: "kafka"
    applabel: "app=cp-kafka"
    appkind: "statefulset"
  chaosServiceAccount: kafka-broker-pod-failure-sa
  experiments:
  - name: kafka-broker-pod-failure
    spec:
      components:
        env:
        # namespace where kafka installed
        - name: KAFKA_NAMESPACE
          value: 'kafka'
        # labels of the kafka
        - name: KAFKA_LABEL
          value: 'app=cp-kafka'
        # name of the kafka service
        - name: KAFKA_SERVICE
          value: 'kafka-cp-kafka-headless'
        # kafka port number
        - name: KAFKA_PORT
          value: '9092'
        # namespace of the zookeeper
        - name: ZOOKEEPER_NAMESPACE
          value: 'default'
        # labels of the zookeeper
        - name: ZOOKEEPER_LABEL
          value: 'app=cp-zookeeper'
        # name of the zookeeper service
        - name: ZOOKEEPER_SERVICE
          value: 'kafka-cp-zookeeper-headless'
        # port of the zookeeper service
        - name: ZOOKEEPER_PORT
          value: '2181'
        # name of the kafka broker
        - name: KAFKA_BROKER
          value: 'kafka-0'
        # kafka replication factor
        - name: KAFKA_REPLICATION_FACTOR
          value: '3'
        # duration of the chaos
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'

Liveness check of kafka

  • The kafka liveness can be tuned with KAFKA_LIVENESS_STREAM env. Provide KAFKA_LIVENESS_STREAM as enable to enable the liveness check and provide KAFKA_LIVENESS_STREAM as disable to skip the liveness check. The default value is disable.
  • The Kafka liveness image can be provided at KAFKA_LIVENESS_IMAGE.
  • The kafka liveness pod contains producer and consumer to validate the message stream during the chaos. The timeout for the consumer can be tuned with KAFKA_CONSUMER_TIMEOUT.

Use the following example to tune this:

## checks the kafka message liveness while injecting chaos
## sets the consumer timeout
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  appinfo:
    appns: "kafka"
    applabel: "app=cp-kafka"
    appkind: "statefulset"
  chaosServiceAccount: kafka-broker-pod-failure-sa
  experiments:
  - name: kafka-broker-pod-failure
    spec:
      components:
        env:
        # check for the kafa liveness message stream during chaos
        # supports: enable, disable. default value: disable
        - name: KAFKA_LIVENESS_STREAM
          value: 'enable'
        # timeout of the kafka consumer
        - name: KAFKA_CONSUMER_TIMEOUT
          value: '30000' # in ms
        # image of the kafka liveness pod
        - name: KAFKA_LIVENESS_IMAGE
          value: ''
        - name: KAFKA_NAMESPACE
          value: 'kafka'
        - name: KAFKA_LABEL
          value: 'app=cp-kafka'
        - name: KAFKA_SERVICE
          value: 'kafka-cp-kafka-headless'
        - name: KAFKA_PORT
          value: '9092'
        - name: ZOOKEEPER_NAMESPACE
          value: 'default'
        - name: ZOOKEEPER_LABEL
          value: 'app=cp-zookeeper'
        - name: ZOOKEEPER_SERVICE
          value: 'kafka-cp-zookeeper-headless'
        - name: ZOOKEEPER_PORT
          value: '2181'
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'

Mutiple Iterations Of Chaos

The multiple iterations of chaos can be tuned via setting CHAOS_INTERVAL ENV. Which defines the delay between each iteration of chaos.

Use the following example to tune this:

# defines delay between each successive iteration of the chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  appinfo:
    appns: "kafka"
    applabel: "app=cp-kafka"
    appkind: "statefulset"
  chaosServiceAccount: kafka-broker-pod-failure-sa
  experiments:
  - name: kafka-broker-pod-failure
    spec:
      components:
        env:
         # delay between each iteration of chaos
        - name: CHAOS_INTERVAL
          value: '15'
        # time duration for the chaos execution
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'
        - name: KAFKA_NAMESPACE
          value: 'kafka'
        - name: KAFKA_LABEL
          value: 'app=cp-kafka'
        - name: KAFKA_SERVICE
          value: 'kafka-cp-kafka-headless'
        - name: KAFKA_PORT
          value: '9092'
        - name: ZOOKEEPER_NAMESPACE
          value: 'default'
        - name: ZOOKEEPER_LABEL
          value: 'app=cp-zookeeper'
        - name: ZOOKEEPER_SERVICE
          value: 'kafka-cp-zookeeper-headless'
        - name: ZOOKEEPER_PORT
          value: '2181'