Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-18058: Share group state record pruning impl. #18014

Merged
merged 38 commits into from
Dec 12, 2024

Conversation

smjn
Copy link
Contributor

@smjn smjn commented Dec 3, 2024

What

  • In this PR, we've added a class ShareCoordinatorOffsetsManager, which tracks the last redundant offset for each share group state topic partition. We have also added a periodic timer job in ShareCoordinatorService which queries for the redundant offset at regular intervals and if a valid value is found, issues the deleteRecords call to the ReplicaManager via the PartitionWriter. In this way the size of the partitions is kept manageable.
  • Introduced new config share.coordinator.state.topic.prune.interval.ms (default 5 mins).

Why

  • Currently, the SharePartition invokes the DefaultStatePersister to write data into the internal share group topic __share_group_state. This topic is not eligible for compaction.
  • If the kafka cluster is run for a long period of time, the topic partitions in __share_group_state will be populated with gigantic amount of records.
  • In the above scenario if a broker/cluster restarts, all the records will need to be replayed on the leader ShareCoordinatorShard resulting in extensive latency during startup.
  • One observation is that the records follow a periodic snapshot approach. A snapshot record gets created and then a few incremental updates and after a threshold number, a snapshot again for each share partition key.
  • This implies that older records become redundant after a while since they've been superseded by new snapshots. This provides us an opportunity to relieve some pressure from the partitions.

Testing

  • Added appropriate tests for CoordinatorPartitionWriter, ShareCoordinatorOffsetsManager, ShareCoordinatorService and ShareCoordinatorShard.
  • Extensive manul testing

Sample o/p

Broker logs

[2024-11-29 15:21:38,289] INFO [UnifiedLog partition=__share_group_state-3, dir=/tmp/kraft-combined-logs] Incremented log start offset to 10 due to client delete records request (kafka.log.UnifiedLog)

Records before prune

{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":0,"stateEpoch":0,"leaderEpoch":0,"startOffset":5,"stateBatches":[{"firstOffset":5,"lastOffset":6,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":0,"leaderEpoch":0,"startOffset":7,"stateBatches":[{"firstOffset":7,"lastOffset":8,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":0,"leaderEpoch":0,"startOffset":9,"stateBatches":[{"firstOffset":9,"lastOffset":9,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":0,"leaderEpoch":0,"startOffset":10,"stateBatches":[{"firstOffset":10,"lastOffset":11,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":1,"stateEpoch":0,"leaderEpoch":0,"startOffset":12,"stateBatches":[{"firstOffset":12,"lastOffset":14,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":1,"leaderEpoch":0,"startOffset":15,"stateBatches":[{"firstOffset":15,"lastOffset":16,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":1,"leaderEpoch":0,"startOffset":17,"stateBatches":[{"firstOffset":17,"lastOffset":19,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":2,"stateEpoch":0,"leaderEpoch":0,"startOffset":20,"stateBatches":[{"firstOffset":20,"lastOffset":21,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":2,"leaderEpoch":0,"startOffset":22,"stateBatches":[{"firstOffset":22,"lastOffset":27,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":1,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":1,"data":{"snapshotEpoch":2,"leaderEpoch":0,"startOffset":28,"stateBatches":[{"firstOffset":28,"lastOffset":28,"deliveryState":2,"deliveryCount":1}]}}}{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":3,"stateEpoch":0,"leaderEpoch":0,"startOffset":29,"stateBatches":[{"firstOffset":29,"lastOffset":29,"deliveryState":2,"deliveryCount":1}]}}}

Records after prune

{"key":{"version":0,"data":{"groupId":"gs1","topicId":"o_eD4bB7RNauIV7oPkNRCA","partition":0}},"value":{"version":0,"data":{"snapshotEpoch":3,"stateEpoch":0,"leaderEpoch":0,"startOffset":29,"stateBatches":[{"firstOffset":29,"lastOffset":29,"deliveryState":2,"deliveryCount":1}]}}}

@github-actions github-actions bot added core Kafka Broker small Small PRs labels Dec 3, 2024
@github-actions github-actions bot added KIP-932 Queues for Kafka and removed small Small PRs labels Dec 3, 2024
Comment on lines 1178 to 1179
// reject delete records operation on internal topics except for allow listed ones
if (Topic.isInternal(topicPartition.topic) && !config.internalTopicsRecordDeleteAllowList.contains(topicPartition.topic)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this is the wrong approach because it will allow users of the cluster to delete too. When I suggested it, I thought that we would add a boolean to the method, e.g. allowInternalTopics, and use it from the component doing the deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, will rectify

@smjn smjn requested a review from dajac December 3, 2024 10:50
Copy link
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have done a partial review and left some comments.

@@ -71,6 +72,10 @@ public class ShareCoordinatorConfig {
public static final int APPEND_LINGER_MS_DEFAULT = 10;
public static final String APPEND_LINGER_MS_DOC = "The duration in milliseconds that the share coordinator will wait for writes to accumulate before flushing them to disk.";

public static final String STATE_TOPIC_PRUNE_INTERVAL = "share.coordinator.state.topic.prune.interval.ms";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be STATE_TOPIC_PRUNE_INTERVAL_MS_CONFIG I think. All three of these should have _MS_ as part of the name.

trace("Delete records on local logs to offsets [%s]".format(offsetPerPartition))
offsetPerPartition.map { case (topicPartition, requestedOffset) =>
// reject delete records operation on internal topics
if (Topic.isInternal(topicPartition.topic)) {
// reject delete records operation for internal topics if allowInternalTopicDeletion is false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably "unless allowInternalTopicDeletion is true" is clearer.

/**
* Delete records from a topic partition until specified offset
* @param tp The partition to delete records from
* @param deleteUntilOffset Offset to delete until, starting from the beginning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest deleteBeforeOffset rather than deleteUntilOffset. The latter is clearly non-inclusive, while the latter is a bit more ambiguous. I think the effect we want here is that if I provide offset 10, then offsets up to and including 9 may be deleted, but not 10.

void deleteRecords(
TopicPartition tp,
long deleteUntilOffset,
boolean allowInternalTopicDeletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is missing from the javadoc.

timer.add(new TimerTask(config.shareCoordinatorTopicPruneIntervalMs()) {
@Override
public void run() {
for (int i = 0; i < numPartitions; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would It not be the case that each shard should set up the pruning for its owned partitions? Shouldn't the pruning stop when a shard loses leadership of a partition?

Copy link
Contributor Author

@smjn smjn Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - the shard (state machine) does not maintain that mapping. The information is maintained by the runtime which calls the appropriate shards, based on the context. This information is not exposed outside. We need access to https://github.com/apache/kafka/blob/trunk/coordinator-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java#L1877 to expose this information. Even then the shard cannot do this.

The runtime is encapsulated in the ShareCoordinatorService and only it can issue calls to the runtime. The Shard only serves to provide data related to partitions.

Using the loop approach - for a specific internal topic-partition only the correct Shard will honour the request and the others will fail silently due to NOT_COORDINATOR.

Flow is

                               ShareCoordinatorShard.callback
                                                     |
                                                     |  
                      add task      task with correct shard         
ShareCoordinatorService ---->   Runtime -----> ==================== ----> EventProcessor
                                   |                   QUEUE
                                   |
                             obtain shard from TP in task

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should expose a method to get the list of active state machines/shards from the runtime. This would allow you to just iterate on it instead of having to list all the possibilities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been thinking about this. I'm not entirely comfortable with every broker starting a timer for every partition. I know it's harmless, but it's not exactly elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dajac we don't need the coordinators, but the list of topic partitions whose coordinators are ACTIVE at the point of execution of the timer job.

CoordinatorRuntime

    public List<TopicPartition> activeTopicPartitions() {
        return coordinators.entrySet().stream()
            .filter(entry -> entry.getValue().state.equals(CoordinatorState.ACTIVE))
            .map(Map.Entry::getKey)
            .toList();
    }

Caller:

activeTopicPartitions.forEach(tp -> runtime.scheudle(tp, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndrewJSchofield @dajac removed the loop in favor of active tps

@smjn smjn requested a review from AndrewJSchofield December 4, 2024 08:29
Comment on lines 276 to 279
if (exception != null) {
log.error("Last redundant offset lookup threw an error.", exception);
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I suggest to handle the known errors such as NOT_COORDINATOR, COORDINATOR_LOADING, etc. As you optimistically send the write operations to all the possible shards, you will get many "unknown" ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the logging for above errors - they will happen in every case. Since they timer job is periodic, we do not need any special handling anyway.

timeout = 0L,
offsetPerPartition = Map(tp -> deleteBeforeOffset),
responseCallback = results => deleteResults = results,
allowInternalTopicDeletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just set it to true here as this class is only used by the coordinator.

assertEquals(1, manager.curState().size());
verify(manager, times(5)).purge();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We usually have an empty line at the end.

@smjn smjn changed the title Prune records initial commit. KAFKA-18058: Share group state record pruning impl. Dec 4, 2024
@smjn smjn requested a review from dajac December 4, 2024 11:17
@smjn smjn marked this pull request as ready for review December 4, 2024 11:18
@github-actions github-actions bot added the tools label Dec 10, 2024
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smjn : Thanks for the updated PR. Made a pass of all files. A few more comments.

new TopicPartition("random-topic", 0),
10L
).whenComplete { (_, exp) =>
assertTrue(exp.isInstanceOf[IllegalStateException])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a topic doesn't exist, we should get an unknown topic exception.

partition.createLogIfNotExists(isNew = false, isFutureReplica = false,
new LazyOffsetCheckpoints(rm.highWatermarkCheckpoints.asJava), None)

rm.becomeLeaderOrFollower(0, new LeaderAndIsrRequest.Builder(ApiKeys.LEADER_AND_ISR.latestVersion, 0, 0, brokerEpoch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

becomeLeaderOrFollower is only used in ZK based controller, which won't be supported in 4.0. We need to use the cod path for KRaft based controller.

List.of(
ShareOffsetTestHolder.TestTuple.instance(KEY1, 10L, Optional.empty()),
ShareOffsetTestHolder.TestTuple.instance(KEY4, 11L, Optional.empty()),
ShareOffsetTestHolder.TestTuple.instance(KEY2, 13L, Optional.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test result is counter intuitive. I'd expect lastRedundantOffset for all three to be 10L since we should be able to truncate the log at offset 10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small efficiency gain here as the offsets will be applied in increasing order (partition records auto inc offset) to the updateState method. If the case is such that 10L is the smallest one - it means there are no offsets smaller than that present in the topic partition, hence returning 10L will be of no consequence and save an extra deleteRecords call.

In the algorithm we have chosen to get at least 2 offsets for any key before exposing the redundant offset.

Also since we are exposing the offset only once and then setting the boolean flag, the 2nd and 3rd line should be empty.

),

new ShareOffsetTestHolder(
"redundant state cold partition",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does cold partition mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infrequently written

@@ -274,6 +290,7 @@ public void testReadStateSuccess() throws ExecutionException, InterruptedExcepti
)))
);


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra new line

@Test
public void testRecordPruningTaskPeriodicityWithAllSuccess() throws Exception {
CoordinatorRuntime<ShareCoordinatorShard, CoordinatorRecord> runtime = mockRuntime();
org.apache.kafka.server.util.MockTime time = new org.apache.kafka.server.util.MockTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we import MockTime?

* After returning the value once, the redundant offset is reset.
* @return Optional of type Long representing the offset or empty for invalid offset values
*/
public Optional<Long> lastRedundantOffset() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is ok, but is very customized for the usage in the current only caller. If there is another caller, it can unexpectedly break the existing caller. A better api is probably to always expose the lastRedundantOffset and let the caller handle the case when the same value is returned.

@smjn
Copy link
Contributor Author

smjn commented Dec 10, 2024

@smjn : Thanks for the updated PR. Made a pass of all files. A few more comments.

@junrao Thanks again for the review, incorporated all comments.

@smjn smjn requested a review from junrao December 10, 2024 21:37
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smjn : Thanks for the updated PR. A few more comments.

public ShareCoordinatorOffsetsManager(SnapshotRegistry snapshotRegistry) {
Objects.requireNonNull(snapshotRegistry);
offsets = new TimelineHashMap<>(snapshotRegistry, 0);
minOffset = new TimelineLong(snapshotRegistry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minOffset => lastRedundantOffset ?

minOffset.set(Math.min(minOffset.get(), offset));
offsets.put(key, offset);

Optional<Long> deleteTillOffset = findRedundantOffset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteTillOffset => redundantOffset ?

if (result.isPresent()) {
Long off = result.get();
// Guard and optimization.
if (off == Long.MAX_VALUE || off <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems redundant since ShareCoordinatorOffsetsManager.lastRedundantOffset does that already.

@@ -240,9 +251,96 @@ public void startup(

log.info("Starting up.");
numPartitions = shareGroupTopicPartitionCount.getAsInt();
Map<TopicPartition, Long> offsets = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. offsets => lastPrunedOffsets?
  2. Would it be better to make that an instance val so that we don't have to pass it around?
  3. Should we remove entries when onResignation is called?

@@ -6660,6 +6661,61 @@ class ReplicaManagerTest {
}
}

@Test
def testDeleteRecordsInternalTopicDeleteDisallowed(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of existing usage of rm.becomeLeaderOrFollower. It would be useful to clean them up in a followup jira.

Copy link
Contributor Author

@smjn smjn Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps

private CompletableFuture<Void> performRecordPruning(TopicPartition tp, Map<TopicPartition, Long> offsets) {
// This future will always be completed normally, exception or not.
CompletableFuture<Void> fut = new CompletableFuture<>();
runtime.scheduleWriteOperation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call doesn't do any writes in runtime. Should we use scheduleReadOperation? Similarly, I also don't understand why readState calls runtime.scheduleWriteOperation.

Copy link
Contributor Author

@smjn smjn Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the write operation used here is for the write consistency offered by the method.

The ShareCoordinatorShard.replay calls offsetsManager.updateState with various last written offset values. The replay method itself is called when other write RPCs produce records. However, it does not mean the offset set in replay has been committed.

Now, the coordinator enqueues the write operations in a queue and guarantees that when the scheduleWriteOperation completes, the records it generated have been replicated, even those which were written before it.

The framework however, gives no consistency guarantees between write and read operations. Consider, a write op writing an offset into the offset manager. We only know that this offset is written but not replicated. A subsequent read could give us back the same offset but still there is no guarantee that this offset has been replicated. It is only when the next write operation completes, do we have a guarantee that the previous offset has been committed.

This was extensively discussed with the coordinator framework owner @dajac and we arrived at this solution.

fut.complete(null);
// Update offsets map as we do not want to
// issue repeated deleted
offsets.put(tp, off);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is called from the purgatory thread, it's possible when this call occurs, the partition leader has already resigned and we need to handle that accordingly.

Copy link
Contributor Author

@smjn smjn Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't foresee any consistency issues with that - at max a repeated delete call might be made which is acceptable since the frequency of these calls is very low (in minutes).
Whoever the leader, the partition offsets will not change.

@smjn
Copy link
Contributor Author

smjn commented Dec 11, 2024

@smjn : Thanks for the updated PR. A few more comments.

@junrao Thanks for the review, incorporated few changes. Replied to some suggestions.

@smjn smjn requested a review from junrao December 11, 2024 05:13
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smjn : Thanks for the updated PR. Just one more comment.

@@ -543,6 +629,7 @@ public void onElection(int partitionIndex, int partitionLeaderEpoch) {
@Override
public void onResignation(int partitionIndex, OptionalInt partitionLeaderEpoch) {
throwIfNotActive();
lastPrunedOffsets.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only clear the entry for partitionIndex, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense

@smjn
Copy link
Contributor Author

smjn commented Dec 11, 2024

@smjn : Thanks for the updated PR. Just one more comment.

@junrao Thanks for the review, incorporated change

@smjn smjn requested a review from junrao December 11, 2024 07:20
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smjn : Thanks for the updated PR. LGTM. @AndrewJSchofield and @dajac, any other comments from you?

@dajac
Copy link
Member

dajac commented Dec 11, 2024

@junrao No. All good to me. Thanks!

@AndrewJSchofield
Copy link
Member

@junrao Thanks for the review. I'll take another look this evening and merge once I'm happy.

Copy link
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A handful of final review comments. I've been running with the pruning enabled and it's happily deleting records as intended.

if (result.isPresent()) {
Long off = result.get();

if (lastPrunedOffsets.containsKey(tp) && Objects.equals(lastPrunedOffsets.get(tp), off)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest lastPrunedOffsets.get(tp).longValue() == off. Using the generic object equality method seems odd for just a pair of longs.

Time time) {
Time time,
Timer timer,
PartitionWriter writer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: New line please so the arguments and the method body do not run into each other.

log.info("Startup complete.");
}

private void setupRecordPruning() {
log.info("Scheduling share state topic prune job.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

share-group state topic is what we use in most places.

CompletableFuture.allOf(futures.toArray(new CompletableFuture[]{}))
.whenComplete((res, exp) -> {
if (exp != null) {
log.error("Received error in share state topic prune.", exp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

share-group state topic

Long off = result.get();

if (lastPrunedOffsets.containsKey(tp) && Objects.equals(lastPrunedOffsets.get(tp), off)) {
log.debug("{} already pruned at offset {}", tp, off);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've used till in most places, so I'd replace the at here too.

@@ -50,6 +50,7 @@ private static Map<String, String> testConfigMapRaw() {
configs.put(ShareCoordinatorConfig.LOAD_BUFFER_SIZE_CONFIG, "555");
configs.put(ShareCoordinatorConfig.APPEND_LINGER_MS_CONFIG, "10");
configs.put(ShareCoordinatorConfig.STATE_TOPIC_COMPRESSION_CODEC_CONFIG, String.valueOf(CompressionType.NONE.id));
configs.put(ShareCoordinatorConfig.STATE_TOPIC_PRUNE_INTERVAL_MS_CONFIG, "30000"); // 30 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class doesn't contain any tests, so calling it XYZTest is peculiar. Please rename to ShareCoordinatorTestUtils or similar.

@smjn
Copy link
Contributor Author

smjn commented Dec 11, 2024

A handful of final review comments. I've been running with the pruning enabled and it's happily deleting records as intended.

@AndrewJSchofield Thanks for the review, incorporated comments.

}
fut.complete(null);
// Best effort prevention of issuing duplicate delete calls.
lastPrunedOffsets.put(tp, off);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is ok, but it breaks the pattern in CoordinatorRuntime that all internal states are updated by CoordinatorRuntime threads since lastPrunedOffsets is now updated by the request io threads. We probably could follow the approach of how CoordinatorRuntime waits for a record to be replicated. It appends the record to local log and registers a PartitionListener for onHighWatermarkUpdated(). Once a new HWM is received, CoordinatorRuntime.onHighWatermarkUpdated enqueues a CoordinatorInternalEvent, the processing of which will trigger the update of the internal state. We could follow a similar approach for LowWaterMark. This can be done in a follow up jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While my initial approach was to add functionality in CoordinatorRuntime - the requirement wasn't general enough to modify the runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this is not the final state of this code, but it's definitely a fine starting point.

@AndrewJSchofield AndrewJSchofield merged commit 4c5ea05 into apache:trunk Dec 12, 2024
13 checks passed
peterxcli pushed a commit to peterxcli/kafka that referenced this pull request Dec 18, 2024
In this PR, we've added a class ShareCoordinatorOffsetsManager, which tracks the last redundant offset for each share group state topic partition. We have also added a periodic timer job in ShareCoordinatorService which queries for the redundant offset at regular intervals and if a valid value is found, issues the deleteRecords call to the ReplicaManager via the PartitionWriter. In this way the size of the partitions is kept manageable.

Reviewers: Jun Rao <[email protected]>, David Jacot <[email protected]>, Andrew Schofield <[email protected]>
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
In this PR, we've added a class ShareCoordinatorOffsetsManager, which tracks the last redundant offset for each share group state topic partition. We have also added a periodic timer job in ShareCoordinatorService which queries for the redundant offset at regular intervals and if a valid value is found, issues the deleteRecords call to the ReplicaManager via the PartitionWriter. In this way the size of the partitions is kept manageable.

Reviewers: Jun Rao <[email protected]>, David Jacot <[email protected]>, Andrew Schofield <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-approved core Kafka Broker KIP-932 Queues for Kafka tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants