Tips for slow compact on a large bucket with large blocks #4434

sevagh · 2021-06-04T14:11:40Z

sevagh
Jun 4, 2021

Hello,

I use Thanos with a rather large bucket (Ceph object store) - 10TB total. I store metrics at the raw resolution with 30 days retention, downsampling disabled.

Here's my systemd daemon for it:

/opt/prometheus2/thanos compact \
        --data-dir=/thanos-compact \
        --objstore.config-file=/etc/prometheus2/thanos_bucket_config.yml \
        --http-address=0.0.0.0:10910 \
        --retention.resolution-raw=30d \
        --retention.resolution-5m=1m \
        --retention.resolution-1h=1m \
        --downsampling.disable \
        --delete-delay=0s \
        --wait

The daemon runs in continual mode, but recently it has been slow to complete its compaction runs and it doesn't reach the "delete blocks" part of the run until it's too late and the storage bucket is overflowing (related issue: #2605)

I recently ran a compaction without the --wait mode, just to see how long a single run takes, and it's been 9 days so far without any deletions.

I have a locally compiled binary of Thanos which only runs block deletions on blocks marked for deletion, described here: #2605 (comment)

One thing I can do is:

Run compact for several days
Interrupt/exit it early (this should be safe to do, right?)
Run custom compact-deleter binary
Resume compactor

What I'm looking for is perhaps tips or solutions on how I can make this better.

Run multiple compactors. I understand the current design has the compactor as a singleton for multiple access safety (and I'm not a distributed systems engineer, so I don't know of the challenges of having multiple compactors not overwrite each others' data)
Bump up compact concurrency (probably an easy win, since I haven't tried that yet: https://thanos.io/tip/components/compact.md/#cpu)
There is a label sharding link for scaling compactor, but the link is dead: https://thanos.io/tip/sharding.md#compactor -
There's also the tip if single blocks are too large for compactor: Limit size of blocks to X bytes on compaction. #3068 - I believe this is the most likely case I'm hitting. We run some very fat single Prometheus instances such that there are many, many metrics per block

Here are the metrics from the compact instance:

Any tips would be appreciated, thanks!

Answered by wiardvanrij

Jun 6, 2021

The correct link is now: https://thanos.io/tip/thanos/sharding.md/#compactor
About scalability: https://thanos.io/tip/components/compact.md/#scalability

There are various features about improving compactor performance. This is an umbrella issue which tracks them: #4233

Your first issue link is/should be resolved with #3115

That said, I'm not sure if you are really hitting certain limits that could not be already resolved by tweaking your setup. So I'm curious about what Thanos version you are using and if you could tell me something about the stats of it (i.e. cpu & memory usage). Did you also limit those stats?
Could you also give us some number of the amount of series per 2 hour block? …

View full answer

wiardvanrij · 2021-06-06T09:54:12Z

wiardvanrij
Jun 6, 2021
Maintainer

The correct link is now: https://thanos.io/tip/thanos/sharding.md/#compactor
About scalability: https://thanos.io/tip/components/compact.md/#scalability

There are various features about improving compactor performance. This is an umbrella issue which tracks them: #4233

Your first issue link is/should be resolved with #3115

That said, I'm not sure if you are really hitting certain limits that could not be already resolved by tweaking your setup. So I'm curious about what Thanos version you are using and if you could tell me something about the stats of it (i.e. cpu & memory usage). Did you also limit those stats?
Could you also give us some number of the amount of series per 2 hour block? This is basically the only stat that matters in this case. The amount of data in the bucket does not give the right details. However, having millions of series per 2h blocks would definitely indicate you might hit some limits.

If you want to run multiple compactors, you could look into the labels. As per docs: "This allows to assign multiple streams to each instance of compactor."
Since you have to configure this correctly I'm not going to exactly say how you can do this, as I prefer you test this out on a TST env, so you are sure it works for your setup. Compactor is a thing thats irreversible if done wrong. Hence this should need proper testing.

For example for store component one could use a relabel like this (don't use this for compactor!):

          --selector.relabel-config=
            - action: hashmod
              source_labels: ["__block_id"]
              target_label: shard
              modulus: {{ $shards }}
            - action: keep
              source_labels: ["shard"]
              regex: {{ $index }}

Yet this should not be used for compactor as this is not 'pinned' towards a specific stream. We merely split all data over multiple shards.

So, you want some form of relabel config that regex on streams; i.e.;

            - action: keep
              regex: my-instance
              source_labels: ["cluster"]

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

0 replies

yeya24 · 2021-06-30T06:24:58Z

yeya24
Jun 30, 2021
Maintainer

I am considering the same thing and I think you already gave the answer @sevagh 😄.
Right now the only way to scale the compactor is:

Add more compaction concurrency
Use hash partitioning (sharding) mentioned by @wiardvanrij. Shard the blocks by some labels which group your blocks from the same cluster together
Combine the two

0 replies

sevagh · 2021-07-09T19:31:21Z

sevagh
Jul 9, 2021
Author

Thanos for the replies. @yeya24 I'm reading this PR you recently got merged, and I think it might help me: https://github.com/thanos-io/thanos/pull/4239/files/15acd8c8683c8ecc785ec71e4c16f89738e839b6#diff-59764a4da653d4464eac20465390033ab8abbd8b54688979727065cb389e848d

One of my issues with Ceph-Thanos is that I have 2x Prometheus pollers like a typical HA setup, and store 2x copies of each tsdb block (slightly different due to natural differences between two pollers).

It looks like the offline deduplication you added with "penalty" mode intended for HA Prometheus would shrink my ceph bucket by 50%ish? By combining these 2x HA blocks?

0 replies

yeya24 · 2021-07-10T00:50:57Z

yeya24
Jul 10, 2021
Maintainer

Thanos for the replies. @yeya24 I'm reading this PR you recently got merged, and I think it might help me: https://github.com/thanos-io/thanos/pull/4239/files/15acd8c8683c8ecc785ec71e4c16f89738e839b6#diff-59764a4da653d4464eac20465390033ab8abbd8b54688979727065cb389e848d

One of my issues with Ceph-Thanos is that I have 2x Prometheus pollers like a typical HA setup, and store 2x copies of each tsdb block (slightly different due to natural differences between two pollers).

It looks like the offline deduplication you added with "penalty" mode intended for HA Prometheus would shrink my ceph bucket by 50%ish? By combining these 2x HA blocks?

That penalty mode can reduce the data size in your bucket by less than 50%. About 30% ~ 45 % I guess as only chunk data are deduplicated, but indexes are still the same compared to the regular compaction.

0 replies

yeya24 · 2021-07-10T19:24:24Z

yeya24
Jul 10, 2021
Maintainer

@sevagh Let me move this to discussion as it is generally a question, not an issue.

0 replies

sevagh · 2021-07-14T15:30:38Z

sevagh
Jul 14, 2021
Author

So, I finally had the chance to upgrade Thanos from 0.16.0-rc0 (I installed this almost a year ago, I think) to 0.22.0-rc0.

Huge benefits! I'm pleased.

No breaking changes/incompatibility between 0.16.0 and 0.22.0 Thanos interactions, which makes it a very comfortable upgrade.
Performance gains. CPU and memory dropped dramatically on all Thanos hosts (which run Query, Sidecar, Rule, Store). The changelogs over the past year mentioned many optimizations.
Blocks are deleted during compaction, not after, as added by this change: cmd: compact: clean partial / marked blocks concurrently #3115 - this is way nicer for my Ceph cluster which used to overuse disk while compactor was working without deleting
I added the following compactor settings:

+       --compact.enable-vertical-compaction \
+       --deduplication.func=penalty \
+       --deduplication.replica-label="replica" \

All of my Thanos components use the replica label and there are 2x HA pairs for each Prometheus poller and Thanos Rule instance which upload blocks. It's a bit soon to guess the space savings on the Ceph cluster (I'm sure the compactors have a lot of work to do to process my previous 8TB/30d of metrics blocks), but right off the bat it looks to be trending downwards which is promising.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for slow compact on a large bucket with large blocks #4434

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tips for slow compact on a large bucket with large blocks #4434

sevagh Jun 4, 2021

Replies: 6 comments

wiardvanrij Jun 6, 2021 Maintainer

yeya24 Jun 30, 2021 Maintainer

sevagh Jul 9, 2021 Author

yeya24 Jul 10, 2021 Maintainer

yeya24 Jul 10, 2021 Maintainer

sevagh Jul 14, 2021 Author

sevagh
Jun 4, 2021

wiardvanrij
Jun 6, 2021
Maintainer

yeya24
Jun 30, 2021
Maintainer

sevagh
Jul 9, 2021
Author

yeya24
Jul 10, 2021
Maintainer

yeya24
Jul 10, 2021
Maintainer

sevagh
Jul 14, 2021
Author