-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kafka unstable #1458
Comments
Hi @lbcd , Could you please format the code using Markdown syntax? This will improve readability and ensure proper rendering. Thanks! |
@lbcd We have the same error, the timeframe depends on the data amount I guess as we already see the error after some hours. |
Hello, I have the same error, restart the kafka-controller clean the error, but she come back every day |
at least for us the error somehow has changed with disabling the kafka persistence to this one (in various consumers):
|
I have the same issue, tried a fresh install but no success :( |
Regarding our error above I just found this one: getsentry/self-hosted#1894 |
I am trying to reproduce the error, but the error is NOT reproducible. That's what I'm doing:
send traces check logs:
my values.yaml
|
I could reproduce every time. values = [
yamlencode({
system = {
url = "https://sentry.xxx.com"
adminEmail = "[email protected]"
public = true
}
asHook : false
relay = {
processing = {
kafkaConfig = {
messageMaxBytes = 50000000
}
}
}
sentry = {
web = {
replicas = 1
}
cron = {
replicas = 1
}
slack = {
client-id = "xxx"
existingSecret = "sentry"
}
}
nginx = {
enabled = false
}
ingress = {
enabled = true
ingressClassName = "nginx"
annotations = {
"cert-manager.io/issuer" = "letsencrypt"
"cert-manager.io/private-key-size" = "4096"
"cert-manager.io/private-key-rotation-policy" = "Always"
# "kubernetes.io/ingress.class" = "nginx"
"nginx.ingress.kubernetes.io/use-regex" = "true"
"nginx.ingress.kubernetes.io/proxy-buffers-number" = "16"
"nginx.ingress.kubernetes.io/proxy-buffer-size" = "32k"
}
hostname = "sentry.xxx.com"
# additionalHostNames = ["sentry.xxx.com"]
tls = [
{
hosts = ["sentry.xxxcom"]
secretName = "ingress-tls"
}
]
}
mail = {
backend = "smtp"
useTls = true
existingSecret = "sentry"
username = "xxx"
port = 587
host = "xxx"
from = "xxx"
}
kafka = {
controller = {
replicaCount = 1
podSecurityContext = {
seccompProfile = {
type = "Unconfined"
}
}
}
}
metrics = {
enabled = true
resources = {
requests = {
cpu = "100m"
memory = "128Mi"
}
limits = {
cpu = "100m"
memory = "128Mi"
}
}
}
filestore = {
filesystem = {
persistence = {
accessMode = "ReadWriteMany"
storageClass = "csi-nas"
persistentWorkers = false
}
}
}
postgresql = {
enabled = false
}
slack = {
client-id = "xxx"
existingSecret = "sentry"
}
github = {
appId = "xxx"
appName = "xxx"
existingSecret = "sentry"
existingSecretPrivateKeyKey = "gh-private-key"
existingSecretWebhookSecretKey = "gh-webhook-secret"
existingSecretClientIdKey = "gh-client-id"
existingSecretClientSecretKey = "gh-client-secret"
}
externalPostgresql = {
host = "xxx"
port = 5432
existingSecret = "sentry"
existingSecretKeys = {
username = "database-username"
password = "database-password"
}
database = "sentry"
sslMode = "require"
}
serviceAccount = {
"linkerd.io/inject" = "enabled"
}
}),
] Some more logs before the stacktrace:
|
@patsevanton for us this only happens after few ours, while we have requests amounts of millions, so I'm not sure how many traces you sent in your test? |
I left exception generation overnight.
but the problem is that Kafka didn't have enough space
I think the reason Kafka took up the entire disk is that few users subtract data from Kafka There are no "Offset out of range" errors
How many errors do you have in total per hour? |
@patsevanton thank you for all the testing! At least we have disabled the persistence
Before that we had exactly the same Kafka error as you stated above (KafkaError{code=_UNKNOWN_PARTITION,val=-190,str="Failed to get watermark offsets: Local: Unknown partition"}) After setting the above one we get the error:
But as said I think this can be fixed with setting all noStrictOffsetReset to true, but I could not yet try as our Operating has to rollout the change. After that I will let you know how many errors are there per hour as well as if the change helped to fix the problem. |
Disabling the persistence solved the issue for me. I am not sure if the problem was caused by |
I have reproduced the application error on HDD StorageClass. The app has been sending errors to sentry all day. Now it remains to understand what is the cause of the error and how to fix it. now I have 75K total errors in 24 hours. The logs often contain messages like:
and
This may indicate problems with network interaction or that the Kafka broker does not have time to process requests due to high load. I will try to test various sentry configuration options. |
@patsevanton thanks for your help! We tried to deploy the newest chart version with new config to test if But I remember the first time it happened we had an error in one of the applications that caused a lot of error messages so the number of errors was at least 400k / hour + transaction events |
@raphaelluchini @sdernbach-ionos how did you disabled persistence? In which file? |
@lbcd within the values.yaml you have to configure that:
so the persistence.enabled: false within controller disables the persistence |
Take the following configuration as a basis for testing
write about your results and your values.yaml |
Hey, for me 2 things combined worked. I have some other issues with health checks, but it's not related to this issue. |
@patsevanton sadly we could still not test it, the redis problem occurred as our account reached its ram limits so we first have to get more resources but I was on vacation, I will come back as soon as we were able to test it. |
@patsevanton sorry it took quite some time to fix the other problems but our redis is stable now and has enough space so we tried your configuration with resourcePreset, replicationFactor but left our storage disabled for broker and controller to not run into space problems. I also enabled noStrictOffsetReset: true everywhere and tried autoOffsetReset: "earliest" and now autoOffsetReset: "latest" (as I think it is fine if we loose some events), the events and transactions are still showing up since ~24h but a lot of pods are throwing these errors: KafkaError{code=UNKNOWN_TOPIC_OR_PART,val=3,str="Subscribed topic not available: transactions-subscription-results: Broker: Unknown topic or partition"} I just saw that we forgot the heapOpts that you mentioned, so will also add this and try a fresh install and see if it fixes it but the error sounds somehow different I just checked this: https://quix.io/blog/how-to-fix-unknown-partition-error-kafka Every of the 3 kafka controller pods says the same with running:
And within this list there is no |
@sdernbach-ionos can you share minimal reproducible values.yaml ? Can you share minimal example or instructions for for send exception using examples from the Internet for benchmark ? |
Hello @sdernbach-ionos I put that in values.yaml:
and I run Installation works. But I get this message : So I think persistence still enabled. And I still have a sentry that works for 2 weeks and start to failed. |
@lbcd this is an error that I always got already in the past, but you can use eg. Lens to check the pvc and would see that there is no pvc created, so actually the config works and no storage is created. @patsevanton to be honest it is hard to see which part is influencing this. But I really do not get why topics are getting deleted. We did a fresh install today with heapOpts activated as well. After the install the topic list shows this:
After few hours (3-6h) the pods start to tell about missing topics and if I execute the list topics command they are also not shown on any of the kafka controllers anymore (and kafka controller has 0 restarts) ... so I really don't know why the topics are getting deleted |
@sdernbach-ionos can you share minimal reproducible values.yaml ? |
@patsevanton here is the full values.yaml (we are using external redis and external postgres, I just re-enabled the default ones again here). Everything that is different from default values.yaml is marked with # CHANGED
|
@sdernbach-ionos pay attention to these parameters
You also have an old version of the helm chart and 3 replicas clickhouse |
@patsevanton right we have not updated to the newest one as we wanted to stick with the 24.7.0 version of sentry and checking the issue first before doing the upgrade, do you think it should help? Then we can also do the update. Also with clickhouse we took that from our old installation of another colleague as of our information also cause of the number of our events he increased it to 3 as it was not working with 1. Do you think we should also directly enable workerTransactions ? As there are not only a lot of issues incoming but also lot of traffic on the websites. Besides the workerEvents and partitions in kafka configuration it should match. But do you really think we have to enable the persistence? We turned it off to not run into space problems if it is too less ... so I think it would make it worse if we turn it on again. |
@sdernbach-ionos I think we should test it. we won't know without testing. |
I still have se same issue since 15/09 (first message of this thread). I only use values.yaml :
kafka part is ignored => coalesce.go:237: warning: skipped value for kafka.config: Not a table. |
Yup, same issue here, none of the fix above worked for me. I tried downgrading, but have hard time running reverse migrations... |
I experienced the same issue I resolved this by increasing the replication factor to 3. The detailed explanation and fix are in this PR: sentry-kubernetes/charts#1606. I hope this helps others facing a similar problem! Additionally, you can keep persistence enabled (persistence: true) and optimize your Kafka configuration by following these recommendations. Make sure to adjust the values to suit your specific requirements.
|
Thanks. |
Dears, Small comment on the above for me did not work. Error appeared only after about 1 day or so I finally got it fixed by trial and error. To give something back here.
I updated the Kafka Statefulset to be able to have some more breathing air
After restarting it finally got back to normal. |
@hamstar3 could you edit your answer and format the code to be code, with indentation? It will save some time for us when testing. Thank you. |
This issue is stale because it has been open for 30 days with no activity. |
Issue submitter TODO list
Describe the bug (actual behavior)
After an installation, it works during 7 or 8 days and after it starts to not have new event in sentry. Sentry website is accessible but there is no new event.
When I go to see containers status I see that some of them restart in loop :
Each of them get this error :
cimpl.KafkaException: KafkaError{code=_UNKNOWN_PARTITION,val=-190,str="Failed to get watermark offsets: Local: Unknown partition"}
en here is last log of sentry-kafka-controller-2 :
[2024-09-15 06:06:51,096] INFO [LogLoader partition=ingest-events-0, dir=/bitnami/kafka/data] Recovering unflushed segment 0. 0/2 recovered for ingest-events-0. (kafka.log.LogLoader)
[2024-09-15 06:06:51,096] INFO [LogLoader partition=ingest-events-0, dir=/bitnami/kafka/data] Loading producer state till offset 0 with message format version 2 (kafka.log.UnifiedLog$)
[2024-09-15 06:06:51,096] INFO [LogLoader partition=ingest-events-0, dir=/bitnami/kafka/data] Reloading from producer snapshot and rebuilding producer state from offset 0 (kafka.log.UnifiedLog$)
[2024-09-15 06:06:51,096] INFO [LogLoader partition=ingest-events-0, dir=/bitnami/kafka/data] Producer state recovery took 0ms for snapshot load and 0ms for segment recovery from offset 0 (kafka.log.UnifiedLog$)
[2024-09-15 06:06:53,012] INFO [BrokerLifecycleManager id=2] The broker is in RECOVERY. (kafka.server.BrokerLifecycleManager)
[2024-09-15 06:06:55,015] INFO [BrokerLifecycleManager id=2] The broker is in RECOVERY. (kafka.server.BrokerLifecycleManager)
[2024-09-15 06:06:57,019] INFO [BrokerLifecycleManager id=2] The broker is in RECOVERY. (kafka.server.BrokerLifecycleManager)
Expected behavior
Expected to continue to have new event in my sentry
values.yaml
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/use-regex: "true"
hostname:
additionalHostNames:
- sentry.xxx.com
tls:
- hosts:
- sentry.xxx.com
secretName: xxx-com-wild-card-2023-2024
mail:
For example: smtp
backend: smtp
useTls: true
useSsl: false
username: "[email protected]"
password: "XXX"
port: 587
host: "outlook.office365.com"
from: "[email protected]"
nginx:
enabled: false
user:
create: true
email: [email protected]
password: xxx
system:
url: "https://sentry.xxx.com"
hooks:
activeDeadlineSeconds: 2000
Helm chart version
24.5.1
Steps to reproduce
Intall
Use it during 8 days
Screenshots
No response
Logs
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: