Prometheus iptables rules metric only counts programmed rules. #9374

aaaaaaaalex · 2024-10-22T10:57:23Z

Description

Related issues/PRs

Todos

Tests
Documentation
Release note

Release Note

The `felix_iptables_rules` Prometheus metric now only counts rules within referenced Iptables chains, no longer counts candidate rules.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

…rammed

fasaxc

The ref counting looks correct 🎉 some comments on the test.

Thinking about it, you could have tested this at UT level pretty effectively if there's a way to get the prom gauge. Then you'd be in control of the exact number of rules and wouldn't need approximates. Not bad to have a bit more coverage of the real gauges though so I'm OK with FV too.

fasaxc · 2024-11-08T09:43:04Z

felix/fv/pre_dnat_test.go

@@ -200,6 +206,87 @@ var _ = infrastructure.DatastoreDescribe("pre-dnat with initialized Felix, 2 wor
 				})

 				Context("with pre-DNAT policy to open pinhole to 8055", func() {
+					// Exec iptables-save, or iptables-save -t [table], if table is not "".
+					dumpIPTables := func(table string) []string {


I think we have utility methods for getting chains on the Felix object.

For example felix.IPTablesChains(table) ...

fasaxc · 2024-11-08T10:07:13Z

felix/fv/pre_dnat_test.go

+						// The number of cali rules in the dumped tables should increase by the same delta as the metrics
+						Eventually(dumpMangleTable).Should(HaveLen(len(baselineMangleTableIptablesSave) + mangleRulesMetricDelta))
+						Eventually(dumpFilterTable).Should(HaveLen(len(baselineFilterTableIptablesSave) + filterRulesMetricDelta))
+					})


I recommend doing some more steps after this:

Change the type of the policy from pre-DNAT to normal. That's the most fiddly change to handle since it goes from referenced in one table to referenced in the other.

Delete the policy, should go back to the count with the hep.

Delete the hep, should get back to the base count.

fasaxc · 2024-11-08T10:14:00Z

felix/fv/pre_dnat_test.go

+					// Returns nil when the value of f() hasn't changed for the duration settledPeriod.
+					// Returns error if f() never settles up to the timeout.
+					// Immediately returns error if timeout is lower than settledPeriod.
+					waitToSettle := func(f func() int, settledPeriod, timeout time.Duration) error {


As discussed on slack, would prefer not to rely on settling, if there's something explicit that we can wait for. It's not a bad approach, it's just a bit slow (and if it flakes, you have to bump the settled timeout so it gets slower...)

If we do go down this path, I suggest making it a general utility method. Kubernetes has a wait package with similar utility funcs in it.

fasaxc · 2024-11-08T10:16:08Z

felix/fv/pre_dnat_test.go

+								return errors.New("Timed out waiting for IPTables to settle.")
+							default:
+								last = cur
+								time.Sleep(1 * time.Second)


I typically reach for 100ms for a sleep loop. We don't want to tight loop but probing 1x per second isn't normally a big deal.

fasaxc · 2024-11-08T10:33:07Z

felix/fv/pre_dnat_test.go

+									settledTimer = nil
+								} else {
+									select {
+									case <-settledTimer.C:


With the way you're using the settledTimer, you're not actually using it as a timer at all! You only select on it in a non-blocking way. I think this code would be clearer if you just recorded the time since the last change and did

last := f() lastChange := time.Now() for { cur := f() if cur != last { lastChange := time.Now() } last = cur if time.Since(lastChange) > settledPeriod { // we're done } ... }

fasaxc · 2024-11-08T10:39:53Z

felix/fv/pre_dnat_test.go

+								}
+							}
+							select {
+							case <-timeoutTimer.C:


Using a sleep and a timeout channel is slightly mixing paradigms. While you''re asleep you're not checking the timeout channel, which could be a bug (or not matter at all). In this case, it means that, if you just miss a timeout, it'll be >1s before you check the timeout again and you'll do an extra poll of f(); not clear if that was intended (it probably doesn't matter for this use case).

Best to jump one way or the other to make your intent clear. Either:

Use Sleep() and record the start time and do if time.Since(startTime) > timeoutPeriod { ... timeout ...}

Use channels for the sleep. case <-time.After(delay): ... so that the sleep can be interrupted.

If you also wanted the settled timer to interrupt the sleep (not clear that you do, since you probably want to run f() one more time before declaring it settled), then... A powerful technique is to mask channels in your main select block. To mask a channel, set it to nil; that prevents its case from firing. So you could do:

var settledC chan <-time.Time for { select { case <- time.After(time.Second): cur := f() if cur == last { if settledC == nil { settledC = time.After(...) } } else { settledC = nil } case <- settledC: // done! case <-timeoutC: // failed } }

aaaaaaaalex requested a review from a team as a code owner October 22, 2024 10:57

marvin-tigera added this to the Calico v3.30.0 milestone Oct 22, 2024

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Oct 22, 2024

aaaaaaaalex force-pushed the bug/iptables-overaccounting branch from 55e5cb8 to 0efef73 Compare October 22, 2024 13:19

amend gaugeNumRules value when a chain becomes referenced/unreferenced

582fb99

aaaaaaaalex force-pushed the bug/iptables-overaccounting branch from 0efef73 to 582fb99 Compare October 29, 2024 14:15

aaaaaaaalex added 2 commits November 7, 2024 15:05

add failing FV

ef6a457

measure the correct table.gp map for determining number of rules prog…

dd6824f

…rammed

aaaaaaaalex changed the title ~~only amend iptables-rules prom gauge when a chain is inc/dec-ref'd~~ Prometheus iptables rules metric only counts programmed rules. Nov 7, 2024

fasaxc requested changes Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus iptables rules metric only counts programmed rules. #9374

Prometheus iptables rules metric only counts programmed rules. #9374

aaaaaaaalex commented Oct 22, 2024 •

edited

Loading

fasaxc left a comment

fasaxc Nov 8, 2024

fasaxc Nov 8, 2024

fasaxc Nov 8, 2024

fasaxc Nov 8, 2024

fasaxc Nov 8, 2024

fasaxc Nov 8, 2024

fasaxc Nov 8, 2024

Prometheus iptables rules metric only counts programmed rules. #9374

Are you sure you want to change the base?

Prometheus iptables rules metric only counts programmed rules. #9374

Conversation

aaaaaaaalex commented Oct 22, 2024 • edited Loading

Description

Related issues/PRs

Todos

Release Note

Reminder for the reviewer

fasaxc left a comment

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

fasaxc Nov 8, 2024

Choose a reason for hiding this comment

aaaaaaaalex commented Oct 22, 2024 •

edited

Loading