Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/kafka] Kafka controllers freezing periodically #31100

Open
igloo12 opened this issue Dec 19, 2024 · 4 comments
Open

[bitnami/kafka] Kafka controllers freezing periodically #31100

igloo12 opened this issue Dec 19, 2024 · 4 comments
Assignees
Labels
kafka stale 15 days without activity tech-issues The user has a technical issue about an application triage Triage is needed

Comments

@igloo12
Copy link

igloo12 commented Dec 19, 2024

Name and Version

bitnami/kafka:30.1.0

What architecture are you using?

None

What steps will reproduce the bug?

Running the helm charts with given values.

Are you using any custom parameters or values?

listeners:
  client:
    protocol: SASL_PLAINTEXT
  controller:
    protocol: SASL_PLAINTEXT
  interbroker:
    protocol: SASL_PLAINTEXT
  external:
    containerPort: 9095
    protocol: SASL_SSL
    name: EXTERNAL
    sslClientAuth: required
externalAccess:
  enabled: true
  controller:
    service:
      type: ClusterIP
      domain: "somedomain.com"
      ports:
        external: 9095
      annotations:
        service.beta.kubernetes.io/oci-load-balancer-internal: "true"
        oci.oraclecloud.com/load-balancer-type: "lb"
        service.beta.kubernetes.io/oci-load-balancer-subnet1: "xxx"
controller:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 10
    timeoutSeconds: 10
    failureThreshold: 3
    periodSeconds: 10
    successThreshold: 1
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    failureThreshold: 6
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
tls:
  existingSecret: kafka-jks-0
  password: xxx
  keystorePassword: xxx
  truststorePassword: xxx
  endpointIdentificationAlgorithm:
extraConfig: |
  allow.everyone.if.no.acl.found=false
  authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
  super.users=User:admin;User:controller_user;User:inter_broker_user
image:
  debug: true
provisioning:
  enabled: true
  resources:
    requests:
      cpu: 2
      memory: 512Mi
    limits:
      cpu: 3
      memory: 1024Mi
  extraProvisioningCommands:
    - echo "Allow user to consume from any topic"
    - "/opt/bitnami/kafka/bin/kafka-acls.sh --bootstrap-server $KAFKA_SERVICE --command-config $CLIENT_CONF --add --allow-principal User:auser --consumer --topic fusion_ --resource-pattern-type prefixed"
    - "/opt/bitnami/kafka/bin/kafka-acls.sh
            --bootstrap-server $KAFKA_SERVICE
            --command-config $CLIENT_CONF
            --list"

What is the expected behavior?

For the controllers to run without restarting

What do you see instead?

The controller will crash after a while and freeze Kafka until it is forced to restart. I can't exec into it, and it uses a lot of CPU power.

$ kubectl top pods -n kafka
NAME                 CPU(cores)   MEMORY(bytes)   
kafka-controller-0   30m          632Mi           
kafka-controller-1   24m          555Mi           
kafka-controller-2   750m         765Mi  

The last of the log messages is this.

[2024-12-19 01:56:55,916] INFO [GroupCoordinator 2]: Dynamic member with unknown member id joins group myapp in PreparingRebalance state. Created a new member id consumer-myapp-3-5046ef33-84e8-43af-b159-3c03bb143ecb and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-12-19 01:56:57,694] INFO [GroupCoordinator 2]: Stabilized group myapp generation 106 (__consumer_offsets-24) with 21 members (kafka.coordinator.group.GroupCoordinator)
[2024-12-19 01:56:57,697] INFO [GroupCoordinator 2]: Assignment received from leader consumer-myapp-2-8abd47e6-efc2-4c61-883a-e7f3f6681e40 for group myapp for generation 106. The group has 21 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-12-19 02:36:14,987] INFO [BrokerLifecycleManager id=2] Unable to send a heartbeat because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager)
@igloo12 igloo12 added the tech-issues The user has a technical issue about an application label Dec 19, 2024
@igloo12 igloo12 changed the title Kafka controller freezing Kafka controller freezing periodically Dec 19, 2024
@igloo12 igloo12 changed the title Kafka controller freezing periodically Kafka controllers freezing periodically Dec 19, 2024
@github-actions github-actions bot added the triage Triage is needed label Dec 19, 2024
@igloo12 igloo12 changed the title Kafka controllers freezing periodically [bitnami/kafka] Kafka controllers freezing periodically Dec 19, 2024
@igloo12
Copy link
Author

igloo12 commented Dec 19, 2024

It happened again, and I noticed that the frozen pod had high disk reads and the health node didn't
Screenshot 2024-12-19 at 8 04 29 AM
Screenshot 2024-12-19 at 8 03 43 AM

@igloo12
Copy link
Author

igloo12 commented Dec 20, 2024

I changed the liveness prod to match the readiness probe. The containers are still dying but they are recovering faster

  customLivenessProbe:
    failureThreshold: 6
    initialDelaySeconds: 5
    periodSeconds: 10
    successThreshold: 1
    tcpSocket:
      port: controller
    timeoutSeconds: 10

@carrodher
Copy link
Member

Hi, the issue may not be directly related to the Bitnami container image/Helm chart, but rather to how the application is being utilized, configured in your specific environment, or tied to a particular scenario that is not easy to reproduce on our side.

If you think that's not the case and want to contribute a solution, we'd like to invite you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Please feel free to contact us if you have any questions or need assistance.

Suppose you have any questions about the application, customizing its content, or technology and infrastructure usage. In that case, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.

Copy link

github-actions bot commented Jan 8, 2025

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kafka stale 15 days without activity tech-issues The user has a technical issue about an application triage Triage is needed
Projects
None yet
Development

No branches or pull requests

2 participants