-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broker disconnects cause database to crash. #18
Comments
is this a 0.9 kafka issue? what version of brokers do you support? |
Unsure what's causing this. I'll look into it and report back. |
Two (different) instances of Segmentation fault. Maybe related to the reported issue.
|
Is there someway I can repro this? |
I have restated the consumer with format = 'json' and have not seen any errors so far. Our payload is quite big, the message of ~ 20K-30K characters is not unusual, sometimes even bigger. It could be some buffer overflow when `format = 'text', but I am speculating. I'll keep watching the stream overnight and update the thread. |
I have been watching it for some time now, but can't pinpoint the source of error.
Here is an example. Notice the timestamp sequence.
Most of the time CONTEXT: COPY looks legitimate, but I can only see the beginning of the message.
Cheers, |
Thanks so much for postmortem report @vryzhov! I'll look into this tomorrow and report my findings + fix the issue. |
Hi @vryzhov, this could possibly be a memory leak issue, as I am currently looking at this also. To determine this, you can use htop and look at the memory being consumed. If it continually increases, then there is a leak. You may also get messages like this in your kernel ring buffer:
|
Yes, you are right, I can see out of memory errors
They are killed by signal 9:
Most of the errors are segmentation faults, signal 11:
|
This explains the magic 10 minute interval you're seeing: confluentinc/librdkafka#437 |
Thanks for digging. Can we confirm the failure is a memory leak. The 10 On Wed, Jun 1, 2016, 8:51 AM Usman Masood [email protected] wrote:
|
@usmanm: Thank you for the link and reference on librdkafka page Here are more logs illustrating
@loadzero; @jofusa: Yes, the memory leak is real. |
I'm trying to repro the memory leak, but that doesn't seem like the issue being causes by the timeouts. The segfaults I think are being causes by the Postgres logging system not being thread safe (even though there's a mutex about the elog function call). |
@vryzhov: Could you give me details of your setup? # of brokers, # partitions and parallelism in pipeline_kafka.consume_begin? |
3 brokers, 3 partitions, parallel = 1 |
@usmanm: I suspected that there is something weird with the logging. Log messages are not always the same - some of them may go missing. Last night I added some tracing in the code by calling |
I'm still trying to repro the segfault and memory leak. Once @loadzero gets in, I'll ask him to help me repro the leak. These broker disconnects happen during idle periods? |
In my example it was during high load. Replaying a consumer from the start On Wed, Jun 1, 2016, 11:50 AM Usman Masood [email protected] wrote:
|
I also was replaying from the beginning. Still catching up Sent from my iPhone
|
I'm going to push a fix out in a bit! |
Can you guys try building the |
Will do later today. On Wed, Jun 1, 2016 at 2:27 PM, Usman Masood [email protected]
|
I have it running for a little over 2 hours by now. It all looks pretty awesome so far. No issues at all. Memory is stable. I see "Receive failed: Disconnected" messages in the log, but there are none of segfaults. I'll keep it running overnight and report tomorrow. |
Great, let me know if you see any issues! |
You're welcome! Happy to hear things are working smoothly now. Thanks for all your help in figuring this out! |
hi, CREATE STREAM kafka_userlog_stream (time_tamp bigint, uuid text, age bigint,a1 text,a2 bigint,a3 numeric,a4 bool); CREATE CONTINUOUS VIEW LOG_COUNT_VIEW WITH (sw = '30 minute') as select count(*) from kafka_userlog_stream; CREATE TABLE DIM_AGE(begin_age int,end_age int, catalog varchar(300),PRIMARY KEY(begin_age,end_age)); CREATE CONTINUOUS TRANSFORM CT_USER_AGE_CATALOG_TOKAFKA AS SELECT s.time_tamp::bigint, s.uuid::text,a.begin_age::int,s.a2::text,a.catalog::text FROM kafka_userlog_stream s JOIN DIM_AGE a ON s.age >= a.begin_age and s.age >= a.end_age THEN EXECUTE PROCEDURE pipeline_kafka.emit_tuple('pipelinedbTriggerTest'); SELECT pipeline_kafka.consume_begin('pipelineUserlog', 'kafka_userlog_stream',format := 'text', delimiter := E'|',batchsize := 1000, maxbytes := 32000000, parallelism := 5,start_offset := '-1'); error log: LOG: worker process: worker0 [pipeline] (PID 16887) was terminated by signal 11: Segmentation fault |
This is using the latest dockerfile in pipelinedb.
Pipeline: 9.2
librdkafka: 0.8
kafka brokers: 0.9.0.1
The text was updated successfully, but these errors were encountered: