Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

[0.24] KafkaChannel to init offsets before dispatcher #886

Conversation

aliok
Copy link
Member

@aliok aliok commented Sep 22, 2021

Fixes #549

Proposed Changes

  • Init offsets, similar to KafkaSource, to make sure dispatcher doesn't fetch events from latest when it starts (more details on the issue)
  • Consumer groups are now started after making sure the initial offsets are committed by the controller(s)
  • NOTE: Check for the initial offsets are done in the ConsumerFactory. This means, KafkaSource adapters and distributed channel can get this functionality for free. However, I didn't enable it for them.
  • Get rid of status probing mechanism for subscription readiness in Kafka Channel
  • Needed to touch some distributed channel code as I changed some interfaces. But, no business logic changes

Release Note

- 🐛  Fixed a bug that was causing Kafka Channel dispatcher to miss some events in case the offset for the consumer group was not initialized

Docs

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aliok

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 22, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 22, 2021
@aliok aliok force-pushed the kafka-ch-event-loss-024 branch from 05189a6 to 1058793 Compare September 22, 2021 12:30
@aliok
Copy link
Member Author

aliok commented Sep 22, 2021

/retest

@codecov
Copy link

codecov bot commented Sep 22, 2021

Codecov Report

❗ No coverage uploaded for pull request base (release-0.24@bd92902). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 21ee494 differs from pull request most recent head 62cf6d0. Consider uploading reports for the commit 62cf6d0 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             release-0.24     #886   +/-   ##
===============================================
  Coverage                ?   74.06%           
===============================================
  Files                   ?      134           
  Lines                   ?     5954           
  Branches                ?        0           
===============================================
  Hits                    ?     4410           
  Misses                  ?     1319           
  Partials                ?      225           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd92902...62cf6d0. Read the comment docs.

@aliok
Copy link
Member Author

aliok commented Sep 23, 2021

/test pull-knative-sandbox-eventing-kafka-unit-tests

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2021
@aliok aliok force-pushed the kafka-ch-event-loss-024 branch from ab79292 to 0903a03 Compare September 23, 2021 14:07
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2021
@aliok
Copy link
Member Author

aliok commented Sep 27, 2021

/retest

Let's see what happens

@aliok
Copy link
Member Author

aliok commented Sep 28, 2021

/test pull-knative-sandbox-eventing-kafka-integration-test-channel-consolidated

@devguyio
Copy link
Contributor

/assign

Copy link
Contributor

@devguyio devguyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a short review, will continue later today.

@@ -477,7 +492,20 @@ func (r *Reconciler) reconcileChannelService(ctx context.Context, dispatcherName
return svc, nil
}

func (r *Reconciler) createClient(ctx context.Context) (sarama.ClusterAdmin, error) {
func (r *Reconciler) createClients(ctx context.Context) (sarama.Client, sarama.ClusterAdmin, error) {
kafkaClient := r.kafkaClient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're not setting
r.kafkaClient anywhere, is that intentional? wouldn't that mean you're creating a new client with every reconcile loop?

Copy link
Member Author

@aliok aliok Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code that creates the admin client and created the normal client there, the same way.

For clusterAdmin, there's this ticket: IBM/sarama#1162. I see that ticket didn't really get any resolution.

I haven't tried reusing the regular kafkaClient. I can try that, once I made the failing tests working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty bad sarama bug 😞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment that this thing returns both clients? Normal and ClusterAdmin?

func (r *Reconciler) createSaramaClients instead? not just client?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really always need both? See Ahmed's comment with stashing and defering the "regular" client.

Just wondering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one alternative can be that we have two functions, one for each.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have 2 functions. Code looks much better, thanks for the suggestions.

About stashing into r.kafkaClient -> I will try this, one I see tests are passing again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, when I stash the client to r.kafkaClient and reuse it, things break. I have no desire to debug why.

See my comment here: #886 (comment)

You can have a look at the commit list here, https://github.com/knative-sandbox/eventing-kafka/pull/886/commits, and see the job history for each commit there (not always consistent, especially when there's new code pushed while the jobs were running).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. I have no desire to debug why.

that is OK w/ me. That said - mind creating an issue we are having to track this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliok
Copy link
Member Author

aliok commented Sep 28, 2021

/test pull-knative-sandbox-eventing-kafka-integration-test-channel-consolidated

Retesting to see if there's a pattern

@aliok
Copy link
Member Author

aliok commented Oct 1, 2021

/test pull-knative-sandbox-eventing-kafka-integration-test-channel-consolidated
/test pull-knative-sandbox-eventing-kafka-integration-test-channel-consolidated-sasl
/test pull-knative-sandbox-eventing-kafka-integration-test-channel-consolidated-tls

Stashing the kafka client broke things. I am gonna try here again, before reverting commit 4d57547.

@aliok
Copy link
Member Author

aliok commented Oct 1, 2021

Stashing the kafka client broke things. I am gonna try here again, before reverting commit 4d57547.

I reverted it now. See my comment here: #886 (comment)

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-sandbox-eventing-kafka-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/channel/consolidated/dispatcher/dispatcher.go 64.5% 66.1% 1.6
pkg/channel/consolidated/reconciler/controller/kafkachannel.go 54.8% 54.2% -0.6
pkg/channel/distributed/dispatcher/controller/kafkachannel.go 87.5% 71.8% -15.7
pkg/channel/distributed/dispatcher/dispatcher/dispatcher.go 97.7% 97.5% -0.1
pkg/common/consumer/consumer_factory.go 87.2% 84.8% -2.4
pkg/common/consumer/consumer_manager.go 96.9% 98.7% 1.8
pkg/common/consumer/consumergroup_offsets_checker.go Do not exist 4.0%
pkg/common/kafka/offset/offsets.go 81.1% 81.5% 0.4
pkg/source/adapter/adapter.go 52.1% 55.4% 3.3

@knative-prow-robot
Copy link
Contributor

knative-prow-robot commented Oct 1, 2021

@aliok: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-knative-sandbox-eventing-kafka-go-coverage 62cf6d0 link false /test pull-knative-sandbox-eventing-kafka-go-coverage

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aliok
Copy link
Member Author

aliok commented Oct 1, 2021

/test pull-knative-sandbox-eventing-kafka-unit-tests

TestEnableSaramaLogging

@devguyio
Copy link
Contributor

devguyio commented Oct 1, 2021

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2021
@knative-prow-robot knative-prow-robot merged commit 3f2a9d7 into knative-extensions:release-0.24 Oct 1, 2021
@aliok aliok deleted the kafka-ch-event-loss-024 branch October 1, 2021 12:23
aliok added a commit to aliok/eventing-kafka that referenced this pull request Oct 1, 2021
[0.24] KafkaChannel to init offsets before dispatcher (knative-extensions#886)
knative-prow-robot pushed a commit that referenced this pull request Oct 1, 2021
* Cherry pick 3f2a9d7

[0.24] KafkaChannel to init offsets before dispatcher (#886)

* ./hack/update-codegen.sh
aliok added a commit to aliok/eventing-kafka that referenced this pull request Oct 4, 2021
…ions#886)

* Init offsets in Kafka channel controller - first iteration

* Do not check prober for subscription readiness

* Better code style

* Get rid of probing for checking subscription readiness

* Get rid of unused deps

* Ooops, fixed the tests

* Pass context along the functions, which will be necessary later

* Fix unit test

* Move partition retrieval into a separate function, which is going to be used later

* Check if offsets are initialized in the consumerFactory

* IDE "extracts" method LOL

* Make unit tests working

* Move MockClusterAdmin into the common package and reuse it

* Copy paste tests for CheckIfAllOffsetsInitialized

* Unify tests for CheckIfAllOffsetsInitialized and InitOffsets

* Separate tests for CheckIfAllOffsetsInitialized and InitOffsets

* Do not block main reconciliation thread for offset checking

* Remove last crumbs of probing

* Change log level for offset init message

* Move some consts to right place

* Rename checkOffsets.. func to WaitForOffsets...

* Rename consumerOffsetInitializer to ConsumerGroupOffsetsChecker

* Do not handle deleted topics or partitions when checking the offsets

* Copy the partitions array when retrieving partitions

* Address comments

* Separate client and clusteradminclient creation

* Stash kafka client

* Revert "Stash kafka client"

This reverts commit 4d57547

* Do not do any offset initialization when subscription is already marked ready
openshift-merge-robot pushed a commit to openshift-knative/eventing-kafka that referenced this pull request Oct 7, 2021
…ions#886) (#383)

* [0.24] KafkaChannel to init offsets before dispatcher  (knative-extensions#886)

* Init offsets in Kafka channel controller - first iteration

* Do not check prober for subscription readiness

* Better code style

* Get rid of probing for checking subscription readiness

* Get rid of unused deps

* Ooops, fixed the tests

* Pass context along the functions, which will be necessary later

* Fix unit test

* Move partition retrieval into a separate function, which is going to be used later

* Check if offsets are initialized in the consumerFactory

* IDE "extracts" method LOL

* Make unit tests working

* Move MockClusterAdmin into the common package and reuse it

* Copy paste tests for CheckIfAllOffsetsInitialized

* Unify tests for CheckIfAllOffsetsInitialized and InitOffsets

* Separate tests for CheckIfAllOffsetsInitialized and InitOffsets

* Do not block main reconciliation thread for offset checking

* Remove last crumbs of probing

* Change log level for offset init message

* Move some consts to right place

* Rename checkOffsets.. func to WaitForOffsets...

* Rename consumerOffsetInitializer to ConsumerGroupOffsetsChecker

* Do not handle deleted topics or partitions when checking the offsets

* Copy the partitions array when retrieving partitions

* Address comments

* Separate client and clusteradminclient creation

* Stash kafka client

* Revert "Stash kafka client"

This reverts commit 4d57547

* Do not do any offset initialization when subscription is already marked ready

* [0.24] KafkaChannel dispatcher offset checking improvements (knative-extensions#924)

* Change poll loop

Signed-off-by: Pierangelo Di Pilato <[email protected]>

* Change poll loop and requeue channel on failure

* Get rid of unused func

* Fix unit tests

Co-authored-by: Pierangelo Di Pilato <[email protected]>

Co-authored-by: Pierangelo Di Pilato <[email protected]>
aliok added a commit to aliok/eventing-kafka that referenced this pull request Oct 8, 2021
…ions#913)

* Cherry pick 3f2a9d7

[0.24] KafkaChannel to init offsets before dispatcher (knative-extensions#886)

* ./hack/update-codegen.sh
aliok added a commit to aliok/eventing-kafka that referenced this pull request Oct 8, 2021
…ions#913)

* Cherry pick 3f2a9d7

[0.24] KafkaChannel to init offsets before dispatcher (knative-extensions#886)

* ./hack/update-codegen.sh
aliok added a commit to aliok/eventing-kafka that referenced this pull request Oct 8, 2021
…ions#913)

* Cherry pick 3f2a9d7

[0.24] KafkaChannel to init offsets before dispatcher (knative-extensions#886)

* ./hack/update-codegen.sh
aliok added a commit to aliok/eventing-kafka that referenced this pull request Oct 8, 2021
…ions#913)

* Cherry pick 3f2a9d7

[0.24] KafkaChannel to init offsets before dispatcher (knative-extensions#886)

* ./hack/update-codegen.sh
openshift-merge-robot pushed a commit to openshift-knative/eventing-kafka that referenced this pull request Oct 11, 2021
* [0.25] KafkaChannel to init offsets before dispatcher  (knative-extensions#913)

* Cherry pick 3f2a9d7

[0.24] KafkaChannel to init offsets before dispatcher (knative-extensions#886)

* ./hack/update-codegen.sh

* [0.25] KafkaChannel dispatcher offset checking improvements (knative-extensions#929)

* Fix Kafka channel event loss during subscription becoming ready

* Make it look like knative-extensions#926
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants