KAFKA-18058: Share group state record pruning impl. #18014

smjn · 2024-12-03T09:11:40Z

What

In this PR, we've added a class ShareCoordinatorOffsetsManager, which tracks the last redundant offset for each share group state topic partition. We have also added a periodic timer job in ShareCoordinatorService which queries for the redundant offset at regular intervals and if a valid value is found, issues the deleteRecords call to the ReplicaManager via the PartitionWriter. In this way the size of the partitions is kept manageable.
Introduced new config share.coordinator.state.topic.prune.interval.ms (default 5 mins).

Why

Currently, the SharePartition invokes the DefaultStatePersister to write data into the internal share group topic __share_group_state. This topic is not eligible for compaction.
If the kafka cluster is run for a long period of time, the topic partitions in __share_group_state will be populated with gigantic amount of records.
In the above scenario if a broker/cluster restarts, all the records will need to be replayed on the leader ShareCoordinatorShard resulting in extensive latency during startup.
One observation is that the records follow a periodic snapshot approach. A snapshot record gets created and then a few incremental updates and after a threshold number, a snapshot again for each share partition key.
This implies that older records become redundant after a while since they've been superseded by new snapshots. This provides us an opportunity to relieve some pressure from the partitions.

Testing

Added appropriate tests for CoordinatorPartitionWriter, ShareCoordinatorOffsetsManager, ShareCoordinatorService and ShareCoordinatorShard.
Extensive manul testing

Sample o/p

Broker logs

[2024-11-29 15:21:38,289] INFO [UnifiedLog partition=__share_group_state-3, dir=/tmp/kraft-combined-logs] Incremented log start offset to 10 due to client delete records request (kafka.log.UnifiedLog)

Records before prune

{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":0,"stateEpoch":0,"leaderEpoch":0,"startOffset":5,"stateBatches":[{"firstOffset":5,"lastOffset":6,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":0,"leaderEpoch":0,"startOffset":7,"stateBatches":[{"firstOffset":7,"lastOffset":8,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":0,"leaderEpoch":0,"startOffset":9,"stateBatches":[{"firstOffset":9,"lastOffset":9,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":0,"leaderEpoch":0,"startOffset":10,"stateBatches":[{"firstOffset":10,"lastOffset":11,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":1,"stateEpoch":0,"leaderEpoch":0,"startOffset":12,"stateBatches":[{"firstOffset":12,"lastOffset":14,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":1,"leaderEpoch":0,"startOffset":15,"stateBatches":[{"firstOffset":15,"lastOffset":16,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":1,"leaderEpoch":0,"startOffset":17,"stateBatches":[{"firstOffset":17,"lastOffset":19,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":2,"stateEpoch":0,"leaderEpoch":0,"startOffset":20,"stateBatches":[{"firstOffset":20,"lastOffset":21,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":2,"leaderEpoch":0,"startOffset":22,"stateBatches":[{"firstOffset":22,"lastOffset":27,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":2,"leaderEpoch":0,"startOffset":28,"stateBatches":[{"firstOffset":28,"lastOffset":28,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":3,"stateEpoch":0,"leaderEpoch":0,"startOffset":29,"stateBatches":[{"firstOffset":29,"lastOffset":29,"deliveryState":2,"deliveryCount":1}]}}}

Records after prune

{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":3,"stateEpoch":0,"leaderEpoch":0,"startOffset":29,"stateBatches":[{"firstOffset":29,"lastOffset":29,"deliveryState":2,"deliveryCount":1}]}}}

dajac · 2024-12-03T10:29:11Z

core/src/main/scala/kafka/server/ReplicaManager.scala

+      // reject delete records operation on internal topics except for allow listed ones
+      if (Topic.isInternal(topicPartition.topic) && !config.internalTopicsRecordDeleteAllowList.contains(topicPartition.topic)) {


In my opinion, this is the wrong approach because it will allow users of the cluster to delete too. When I suggested it, I thought that we would add a boolean to the method, e.g. allowInternalTopics, and use it from the component doing the deletion.

Understood, will rectify

AndrewJSchofield

Thanks for the PR. I have done a partial review and left some comments.

AndrewJSchofield · 2024-12-03T15:02:26Z

server/src/main/java/org/apache/kafka/server/config/ShareCoordinatorConfig.java

@@ -71,6 +72,10 @@ public class ShareCoordinatorConfig {
    public static final int APPEND_LINGER_MS_DEFAULT = 10;
    public static final String APPEND_LINGER_MS_DOC = "The duration in milliseconds that the share coordinator will wait for writes to accumulate before flushing them to disk.";

+    public static final String STATE_TOPIC_PRUNE_INTERVAL = "share.coordinator.state.topic.prune.interval.ms";


This should be STATE_TOPIC_PRUNE_INTERVAL_MS_CONFIG I think. All three of these should have _MS_ as part of the name.

AndrewJSchofield · 2024-12-03T15:07:07Z

core/src/main/scala/kafka/server/ReplicaManager.scala

    trace("Delete records on local logs to offsets [%s]".format(offsetPerPartition))
    offsetPerPartition.map { case (topicPartition, requestedOffset) =>
-      // reject delete records operation on internal topics
-      if (Topic.isInternal(topicPartition.topic)) {
+      // reject delete records operation for internal topics if allowInternalTopicDeletion is false


probably "unless allowInternalTopicDeletion is true" is clearer.

AndrewJSchofield · 2024-12-03T15:09:32Z

...inator-common/src/main/java/org/apache/kafka/coordinator/common/runtime/PartitionWriter.java

+    /**
+     * Delete records from a topic partition until specified offset
+     * @param tp                The partition to delete records from
+     * @param deleteUntilOffset Offset to delete until, starting from the beginning


I suggest deleteBeforeOffset rather than deleteUntilOffset. The latter is clearly non-inclusive, while the latter is a bit more ambiguous. I think the effect we want here is that if I provide offset 10, then offsets up to and including 9 may be deleted, but not 10.

AndrewJSchofield · 2024-12-03T15:09:58Z

...inator-common/src/main/java/org/apache/kafka/coordinator/common/runtime/PartitionWriter.java

+    void deleteRecords(
+        TopicPartition tp,
+        long deleteUntilOffset,
+        boolean allowInternalTopicDeletion


This parameter is missing from the javadoc.

AndrewJSchofield · 2024-12-03T15:15:00Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+        timer.add(new TimerTask(config.shareCoordinatorTopicPruneIntervalMs()) {
+            @Override
+            public void run() {
+                for (int i = 0; i < numPartitions; i++) {


Would It not be the case that each shard should set up the pruning for its owned partitions? Shouldn't the pruning stop when a shard loses leadership of a partition?

No - the shard (state machine) does not maintain that mapping. The information is maintained by the runtime which calls the appropriate shards, based on the context. This information is not exposed outside. We need access to https://github.com/apache/kafka/blob/trunk/coordinator-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java#L1877 to expose this information. Even then the shard cannot do this.

The runtime is encapsulated in the ShareCoordinatorService and only it can issue calls to the runtime. The Shard only serves to provide data related to partitions.

Using the loop approach - for a specific internal topic-partition only the correct Shard will honour the request and the others will fail silently due to NOT_COORDINATOR.

Flow is

ShareCoordinatorShard.callback | | add task task with correct shard ShareCoordinatorService ----> Runtime -----> ==================== ----> EventProcessor | QUEUE | obtain shard from TP in task

Thanks for the explanation.

I wonder if we should expose a method to get the list of active state machines/shards from the runtime. This would allow you to just iterate on it instead of having to list all the possibilities.

Yes, I've been thinking about this. I'm not entirely comfortable with every broker starting a timer for every partition. I know it's harmless, but it's not exactly elegant.

@dajac we don't need the coordinators, but the list of topic partitions whose coordinators are ACTIVE at the point of execution of the timer job.

CoordinatorRuntime

public List<TopicPartition> activeTopicPartitions() { return coordinators.entrySet().stream() .filter(entry -> entry.getValue().state.equals(CoordinatorState.ACTIVE)) .map(Map.Entry::getKey) .toList(); }

Caller:

activeTopicPartitions.forEach(tp -> runtime.scheudle(tp, ...)

@AndrewJSchofield @dajac removed the loop in favor of active tps

dajac · 2024-12-04T09:55:31Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+            if (exception != null) {
+                log.error("Last redundant offset lookup threw an error.", exception);
+                return;
+            }


Here, I suggest to handle the known errors such as NOT_COORDINATOR, COORDINATOR_LOADING, etc. As you optimistically send the write operations to all the possible shards, you will get many "unknown" ones.

I will remove the logging for above errors - they will happen in every case. Since they timer job is periodic, we do not need any special handling anyway.

dajac · 2024-12-04T09:57:13Z

core/src/main/scala/kafka/coordinator/group/CoordinatorPartitionWriter.scala

+      timeout = 0L,
+      offsetPerPartition = Map(tp -> deleteBeforeOffset),
+      responseCallback = results => deleteResults = results,
+      allowInternalTopicDeletion


I wonder if we could just set it to true here as this class is only used by the coordinator.

dajac · 2024-12-04T09:59:12Z

...tor/src/test/java/org/apache/kafka/coordinator/share/ShareCoordinatorOffsetsManagerTest.java

+        assertEquals(1, manager.curState().size());
+        verify(manager, times(5)).purge();
+    }
+}


nit: We usually have an empty line at the end.

This reverts commit 78d8c49.

junrao

@smjn : Thanks for the updated PR. Made a pass of all files. A few more comments.

junrao · 2024-12-10T18:34:56Z

core/src/test/scala/unit/kafka/coordinator/group/CoordinatorPartitionWriterTest.scala

+      new TopicPartition("random-topic", 0),
+      10L
+    ).whenComplete { (_, exp) =>
+      assertTrue(exp.isInstanceOf[IllegalStateException])


If a topic doesn't exist, we should get an unknown topic exception.

junrao · 2024-12-10T18:39:47Z

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

+    partition.createLogIfNotExists(isNew = false, isFutureReplica = false,
+      new LazyOffsetCheckpoints(rm.highWatermarkCheckpoints.asJava), None)
+
+    rm.becomeLeaderOrFollower(0, new LeaderAndIsrRequest.Builder(ApiKeys.LEADER_AND_ISR.latestVersion, 0, 0, brokerEpoch,


becomeLeaderOrFollower is only used in ZK based controller, which won't be supported in 4.0. We need to use the cod path for KRaft based controller.

junrao · 2024-12-10T18:54:27Z

...tor/src/test/java/org/apache/kafka/coordinator/share/ShareCoordinatorOffsetsManagerTest.java

+                List.of(
+                    ShareOffsetTestHolder.TestTuple.instance(KEY1, 10L, Optional.empty()),
+                    ShareOffsetTestHolder.TestTuple.instance(KEY4, 11L, Optional.empty()),
+                    ShareOffsetTestHolder.TestTuple.instance(KEY2, 13L, Optional.empty())


This test result is counter intuitive. I'd expect lastRedundantOffset for all three to be 10L since we should be able to truncate the log at offset 10.

Small efficiency gain here as the offsets will be applied in increasing order (partition records auto inc offset) to the updateState method. If the case is such that 10L is the smallest one - it means there are no offsets smaller than that present in the topic partition, hence returning 10L will be of no consequence and save an extra deleteRecords call.

In the algorithm we have chosen to get at least 2 offsets for any key before exposing the redundant offset.

Also since we are exposing the offset only once and then setting the boolean flag, the 2nd and 3rd line should be empty.

junrao · 2024-12-10T18:56:49Z

...tor/src/test/java/org/apache/kafka/coordinator/share/ShareCoordinatorOffsetsManagerTest.java

+            ),
+
+            new ShareOffsetTestHolder(
+                "redundant state cold partition",


What does cold partition mean?

infrequently written

junrao · 2024-12-10T18:57:28Z

...oordinator/src/test/java/org/apache/kafka/coordinator/share/ShareCoordinatorServiceTest.java

@@ -274,6 +290,7 @@ public void testReadStateSuccess() throws ExecutionException, InterruptedExcepti
                )))
            );

+


extra new line

junrao · 2024-12-10T19:03:30Z

...oordinator/src/test/java/org/apache/kafka/coordinator/share/ShareCoordinatorServiceTest.java

+    @Test
+    public void testRecordPruningTaskPeriodicityWithAllSuccess() throws Exception {
+        CoordinatorRuntime<ShareCoordinatorShard, CoordinatorRecord> runtime = mockRuntime();
+        org.apache.kafka.server.util.MockTime time = new org.apache.kafka.server.util.MockTime();


Could we import MockTime?

junrao · 2024-12-10T19:21:38Z

...dinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorOffsetsManager.java

+     * After returning the value once, the redundant offset is reset.
+     * @return Optional of type Long representing the offset or empty for invalid offset values
+     */
+    public Optional<Long> lastRedundantOffset() {


This method is ok, but is very customized for the usage in the current only caller. If there is another caller, it can unexpectedly break the existing caller. A better api is probably to always expose the lastRedundantOffset and let the caller handle the case when the same value is returned.

smjn · 2024-12-10T21:37:23Z

@smjn : Thanks for the updated PR. Made a pass of all files. A few more comments.

@junrao Thanks again for the review, incorporated all comments.

junrao

@smjn : Thanks for the updated PR. A few more comments.

junrao · 2024-12-10T22:43:59Z

...dinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorOffsetsManager.java

+    public ShareCoordinatorOffsetsManager(SnapshotRegistry snapshotRegistry) {
+        Objects.requireNonNull(snapshotRegistry);
+        offsets = new TimelineHashMap<>(snapshotRegistry, 0);
+        minOffset = new TimelineLong(snapshotRegistry);


minOffset => lastRedundantOffset ?

junrao · 2024-12-10T22:45:10Z

...dinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorOffsetsManager.java

+        minOffset.set(Math.min(minOffset.get(), offset));
+        offsets.put(key, offset);
+
+        Optional<Long> deleteTillOffset = findRedundantOffset();


deleteTillOffset => redundantOffset ?

junrao · 2024-12-10T22:53:04Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+            if (result.isPresent()) {
+                Long off = result.get();
+                // Guard and optimization.
+                if (off == Long.MAX_VALUE || off <= 0) {


This test seems redundant since ShareCoordinatorOffsetsManager.lastRedundantOffset does that already.

junrao · 2024-12-10T23:08:40Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

@@ -240,9 +251,96 @@ public void startup(

        log.info("Starting up.");
        numPartitions = shareGroupTopicPartitionCount.getAsInt();
+        Map<TopicPartition, Long> offsets = new ConcurrentHashMap<>();


offsets => lastPrunedOffsets?

Would it be better to make that an instance val so that we don't have to pass it around?

Should we remove entries when onResignation is called?

junrao · 2024-12-10T23:19:09Z

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

@@ -6660,6 +6661,61 @@ class ReplicaManagerTest {
    }
  }

+  @Test
+  def testDeleteRecordsInternalTopicDeleteDisallowed(): Unit = {


There are lots of existing usage of rm.becomeLeaderOrFollower. It would be useful to clean them up in a followup jira.

junrao · 2024-12-10T23:54:57Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+    private CompletableFuture<Void> performRecordPruning(TopicPartition tp, Map<TopicPartition, Long> offsets) {
+        // This future will always be completed normally, exception or not.
+        CompletableFuture<Void> fut = new CompletableFuture<>();
+        runtime.scheduleWriteOperation(


This call doesn't do any writes in runtime. Should we use scheduleReadOperation? Similarly, I also don't understand why readState calls runtime.scheduleWriteOperation.

No, the write operation used here is for the write consistency offered by the method.

The ShareCoordinatorShard.replay calls offsetsManager.updateState with various last written offset values. The replay method itself is called when other write RPCs produce records. However, it does not mean the offset set in replay has been committed.

Now, the coordinator enqueues the write operations in a queue and guarantees that when the scheduleWriteOperation completes, the records it generated have been replicated, even those which were written before it.

The framework however, gives no consistency guarantees between write and read operations. Consider, a write op writing an offset into the offset manager. We only know that this offset is written but not replicated. A subsequent read could give us back the same offset but still there is no guarantee that this offset has been replicated. It is only when the next write operation completes, do we have a guarantee that the previous offset has been committed.

This was extensively discussed with the coordinator framework owner @dajac and we arrived at this solution.

junrao · 2024-12-11T00:14:30Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                        fut.complete(null);
+                        // Update offsets map as we do not want to
+                        // issue repeated deleted
+                        offsets.put(tp, off);


Since this is called from the purgatory thread, it's possible when this call occurs, the partition leader has already resigned and we need to handle that accordingly.

I don't foresee any consistency issues with that - at max a repeated delete call might be made which is acceptable since the frequency of these calls is very low (in minutes).
Whoever the leader, the partition offsets will not change.

smjn · 2024-12-11T05:13:18Z

@smjn : Thanks for the updated PR. A few more comments.

@junrao Thanks for the review, incorporated few changes. Replied to some suggestions.

junrao

@smjn : Thanks for the updated PR. Just one more comment.

junrao · 2024-12-11T06:27:40Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

@@ -543,6 +629,7 @@ public void onElection(int partitionIndex, int partitionLeaderEpoch) {
    @Override
    public void onResignation(int partitionIndex, OptionalInt partitionLeaderEpoch) {
        throwIfNotActive();
+        lastPrunedOffsets.clear();


We should only clear the entry for partitionIndex, right?

Yes, makes sense

smjn · 2024-12-11T07:20:09Z

@smjn : Thanks for the updated PR. Just one more comment.

@junrao Thanks for the review, incorporated change

junrao

@smjn : Thanks for the updated PR. LGTM. @AndrewJSchofield and @dajac, any other comments from you?

dajac · 2024-12-11T15:29:42Z

@junrao No. All good to me. Thanks!

AndrewJSchofield · 2024-12-11T16:54:51Z

@junrao Thanks for the review. I'll take another look this evening and merge once I'm happy.

AndrewJSchofield

A handful of final review comments. I've been running with the pruning enabled and it's happily deleting records as intended.

AndrewJSchofield · 2024-12-11T21:37:45Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+            if (result.isPresent()) {
+                Long off = result.get();
+
+                if (lastPrunedOffsets.containsKey(tp) && Objects.equals(lastPrunedOffsets.get(tp), off)) {


I suggest lastPrunedOffsets.get(tp).longValue() == off. Using the generic object equality method seems odd for just a pair of longs.

AndrewJSchofield · 2024-12-11T21:38:46Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

-        Time time) {
+        Time time,
+        Timer timer,
+        PartitionWriter writer) {


nit: New line please so the arguments and the method body do not run into each other.

AndrewJSchofield · 2024-12-11T21:41:52Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

        log.info("Startup complete.");
    }

+    private void setupRecordPruning() {
+        log.info("Scheduling share state topic prune job.");


share-group state topic is what we use in most places.

AndrewJSchofield · 2024-12-11T21:42:00Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                CompletableFuture.allOf(futures.toArray(new CompletableFuture[]{}))
+                    .whenComplete((res, exp) -> {
+                        if (exp != null) {
+                            log.error("Received error in share state topic prune.", exp);


share-group state topic

AndrewJSchofield · 2024-12-11T21:42:58Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                Long off = result.get();
+
+                if (lastPrunedOffsets.containsKey(tp) && Objects.equals(lastPrunedOffsets.get(tp), off)) {
+                    log.debug("{} already pruned at offset {}", tp, off);


You've used till in most places, so I'd replace the at here too.

AndrewJSchofield · 2024-12-11T21:45:11Z

...coordinator/src/test/java/org/apache/kafka/coordinator/share/ShareCoordinatorConfigTest.java

@@ -50,6 +50,7 @@ private static Map<String, String> testConfigMapRaw() {
        configs.put(ShareCoordinatorConfig.LOAD_BUFFER_SIZE_CONFIG, "555");
        configs.put(ShareCoordinatorConfig.APPEND_LINGER_MS_CONFIG, "10");
        configs.put(ShareCoordinatorConfig.STATE_TOPIC_COMPRESSION_CODEC_CONFIG, String.valueOf(CompressionType.NONE.id));
+        configs.put(ShareCoordinatorConfig.STATE_TOPIC_PRUNE_INTERVAL_MS_CONFIG, "30000");  // 30 seconds


This class doesn't contain any tests, so calling it XYZTest is peculiar. Please rename to ShareCoordinatorTestUtils or similar.

smjn · 2024-12-11T22:10:00Z

A handful of final review comments. I've been running with the pruning enabled and it's happily deleting records as intended.

@AndrewJSchofield Thanks for the review, incorporated comments.

junrao · 2024-12-11T23:02:40Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                        }
+                        fut.complete(null);
+                        // Best effort prevention of issuing duplicate delete calls.
+                        lastPrunedOffsets.put(tp, off);


This approach is ok, but it breaks the pattern in CoordinatorRuntime that all internal states are updated by CoordinatorRuntime threads since lastPrunedOffsets is now updated by the request io threads. We probably could follow the approach of how CoordinatorRuntime waits for a record to be replicated. It appends the record to local log and registers a PartitionListener for onHighWatermarkUpdated(). Once a new HWM is received, CoordinatorRuntime.onHighWatermarkUpdated enqueues a CoordinatorInternalEvent, the processing of which will trigger the update of the internal state. We could follow a similar approach for LowWaterMark. This can be done in a follow up jira.

While my initial approach was to add functionality in CoordinatorRuntime - the requirement wasn't general enough to modify the runtime.

I'm sure this is not the final state of this code, but it's definitely a fine starting point.

In this PR, we've added a class ShareCoordinatorOffsetsManager, which tracks the last redundant offset for each share group state topic partition. We have also added a periodic timer job in ShareCoordinatorService which queries for the redundant offset at regular intervals and if a valid value is found, issues the deleteRecords call to the ReplicaManager via the PartitionWriter. In this way the size of the partitions is kept manageable. Reviewers: Jun Rao <[email protected]>, David Jacot <[email protected]>, Andrew Schofield <[email protected]>

Prune records initial commit.

546eda7

github-actions bot added core Kafka Broker small Small PRs labels Dec 3, 2024

added offsets manager class.

dd575d4

github-actions bot added KIP-932 Queues for Kafka and removed small Small PRs labels Dec 3, 2024

dajac reviewed Dec 3, 2024

View reviewed changes

replace config with boolean flag arg.

3dc1798

smjn requested a review from dajac December 3, 2024 10:50

smjn added 2 commits December 3, 2024 16:39

add deleteRecords method to partition writer.

9d19c0b

add plumbing in share coordinator for purging.

87f5a1d

AndrewJSchofield requested changes Dec 3, 2024

View reviewed changes

smjn added 2 commits December 3, 2024 21:31

review comments

d566e22

minor bug fix.

6ae8337

smjn requested a review from AndrewJSchofield December 4, 2024 08:29

smjn added 2 commits December 4, 2024 14:49

shard unit tests.

c49b7db

partition writer unit tests.

f616f85

dajac reviewed Dec 4, 2024

View reviewed changes

added share coord service unit tests.

bea9566

smjn changed the title ~~Prune records initial commit.~~ KAFKA-18058: Share group state record pruning impl. Dec 4, 2024

smjn added 3 commits December 4, 2024 16:04

final touches, incorporated comments.

b0d4cca

Fixed documentation.

9a342c2

added more unit tests, constraint to exception logging.

e77c16d

smjn requested a review from dajac December 4, 2024 11:17

smjn marked this pull request as ready for review December 4, 2024 11:18

added runtime method to fetch active tps.

6d889bc

dajac added the ci-approved label Dec 4, 2024

logging improvements.

b30df09

smjn added 2 commits December 10, 2024 14:10

replaced timeline object with atomic boolean

a5aa93f

updated comments

78d8c49

github-actions bot added the tools label Dec 10, 2024

smjn added 2 commits December 10, 2024 22:45

Revert "updated comments"

9714ce6

This reverts commit 78d8c49.

updated docs

7e420cb

junrao reviewed Dec 10, 2024

View reviewed changes

incorporated comments.

e3db1dc

smjn requested a review from junrao December 10, 2024 21:37

junrao reviewed Dec 10, 2024

View reviewed changes

junrao reviewed Dec 11, 2024

View reviewed changes

incorporated comments.

1d60249

smjn requested a review from junrao December 11, 2024 05:13

junrao reviewed Dec 11, 2024

View reviewed changes

remove part index, instead of clearing memo.

502b5e7

smjn requested a review from junrao December 11, 2024 07:20

junrao approved these changes Dec 11, 2024

View reviewed changes

AndrewJSchofield requested changes Dec 11, 2024

View reviewed changes

incorporated comments.

d55abdd

smjn requested a review from AndrewJSchofield December 11, 2024 22:10

junrao reviewed Dec 11, 2024

View reviewed changes

AndrewJSchofield approved these changes Dec 12, 2024

View reviewed changes

AndrewJSchofield merged commit 4c5ea05 into apache:trunk Dec 12, 2024
13 checks passed

		// reject delete records operation on internal topics except for allow listed ones
		if (Topic.isInternal(topicPartition.topic) && !config.internalTopicsRecordDeleteAllowList.contains(topicPartition.topic)) {

		@@ -274,6 +290,7 @@ public void testReadStateSuccess() throws ExecutionException, InterruptedExcepti
		)))
		);

KAFKA-18058: Share group state record pruning impl. #18014

KAFKA-18058: Share group state record pruning impl. #18014

Conversation

smjn commented Dec 3, 2024 • edited Loading

What

Why

Testing

Sample o/p

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Dec 10, 2024

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

smjn commented Dec 11, 2024

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Dec 11, 2024

junrao left a comment

Choose a reason for hiding this comment

dajac commented Dec 11, 2024

AndrewJSchofield commented Dec 11, 2024

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Dec 3, 2024 •

edited

Loading

smjn Dec 3, 2024 •

edited

Loading

smjn Dec 11, 2024 •

edited

Loading

smjn Dec 11, 2024 •

edited

Loading

smjn Dec 11, 2024 •

edited

Loading