Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS CNI failed to add chain rule for each CIDR in VPC with nf_tables mode #2373

Closed
HenryXie1 opened this issue May 8, 2023 · 10 comments
Closed
Labels

Comments

@HenryXie1
Copy link

HenryXie1 commented May 8, 2023

What happened:
After adopting Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.19.2.el8_7.x86_64 AMI (08d4770d37a0b97a1) which enabled nftable, all AWS CNI pods are pending with:

time="2023-04-17T03:17:29Z" level=info msg="Updating iptables mode to nft"
Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-v4-cni
time="2023-04-17T03:17:29Z" level=info msg="Starting IPAM daemon... "
time="2023-04-17T03:17:29Z" level=info msg="Checking for IPAM connectivity... "
time="2023-04-17T03:17:30Z" level=info msg="Copying config file... "
time="2023-04-17T03:17:30Z" level=info msg="Successfully copied CNI plugin binary and config file."
time="2023-04-17T03:17:30Z" level=error msg="Failed to wait for IPAM daemon to complete" error="exit status 1"

Node's ipamd log has error:

{"level":"error","ts":"2023-04-27T08:27:59.957Z","caller":"networkutils/network.go:368",
"msg":"host network setup: failed to add nat/AWS-SNAT-CHAIN-14 rule [14] AWS-SNAT-CHAIN shouldExist true rule [
! -d 100.64.0.0/16 -m comment --comment AWS SNAT CHAIN -j AWS-SNAT-CHAIN-15], running [/usr/sbin/iptables -t nat -A AWS-SNAT-CHAIN-14 ! -d 100.64.0.0/16 -m comment --comment AWS SNAT CHAIN -j AWS-SNAT-CHAIN-15 --wait]: exit status 4: iptables v1.8.4 (nf_tables): RULE_APPEND failed (Too many links): rule in chain AWS-SNAT-CHAIN-14\n"}

AWS CNI adds chain rule for each CIDR in VPC, and nftable has hard limitation NFT_JUMP_STACK_SIZE = 16 defined in the code (e.g https://codebrowser.dev/linux/linux/include/net/netfilter/nf_tables.h.html#22). So for VPCs which has 15 or above CIDRs, CNI can't attach all rules successfully in nftable which caused error above.

Attach logs

What you expected to happen:
AWS CNI with nf_table mode should be aware of the hard limitation NFT_JUMP_STACK_SIZE = 16 and avoid to create a such long chain.

How to reproduce it (as minimally and precisely as possible):

  1. Create 16+ CIDRs in the VPC
  2. Use redhat 8 as AMI
  3. Install VPC CNI daemonset and set nf_tables mode
  4. Pods should fail to start

Anything else we need to know?:
[Redacted]

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.9", GitCommit:"a1a87a0a2bcd605820920c6b0e618a8ab7d117d4", GitTreeState:"clean", BuildDate:"2023-04-12T12:16:51Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"darwin/amd64"}
    Kustomize Version: v4.5.7
    Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.17-eks-ec5523e", GitCommit:"d9e9b09276855a532739ef8cb728194aa145430b", GitTreeState:"clean", BuildDate:"2023-03-20T18:46:36Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

  • CNI Version

  • v1.12.2

  • OS (e.g: cat /etc/os-release):

  • [root@ANL10230842 ~]# cat /etc/os-release
    NAME="Red Hat Enterprise Linux"
    VERSION="8.7 (Ootpa)"
    ID="rhel"
    ID_LIKE="fedora"
    VERSION_ID="8.7"
    PLATFORM_ID="platform:el8"
    PRETTY_NAME="Red Hat Enterprise Linux 8.7 (Ootpa)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
    HOME_URL="https://www.redhat.com/"
    DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
    BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"

  • Kernel (e.g. uname -a):
  • 4.18.0-425.19.2.el8_7.x86_64
@HenryXie1 HenryXie1 added the bug label May 8, 2023
@jdn5126
Copy link
Contributor

jdn5126 commented May 8, 2023

Hi @HenryXie1 , thanks for raising this issue. Our support for RHEL is best-effort, and attaching 15 CIDRs to a VPC is definitely not a recommended design pattern, so this is a case we have not seen before. To support this, we can restructure this code: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/networkutils/network.go#L427 to have a chain per CIDR and one jump per chain.

Also, I redacted the account information from your post above. You definitely do not want to share that information on a public forum. I filed a support request to have it permanently removed.

@aws aws deleted a comment from HenryXie1 May 8, 2023
@HenryXie1
Copy link
Author

Hey @jdn5126
Thanks for your prompt help. Can you provide ETA for this fix?

@jdn5126
Copy link
Contributor

jdn5126 commented May 9, 2023

Hey @jdn5126 Thanks for your prompt help. Can you provide ETA for this fix?

Sorry, I can't provide any ETA on this as we will have to internally figure out how to prioritize this. In the meantime, we are happy to take PRs if anyone is interested in taking up the fix

@github-actions
Copy link

github-actions bot commented Jul 9, 2023

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jul 9, 2023
@jdn5126 jdn5126 removed the stale Issue or PR is stale label Jul 11, 2023
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 10, 2023
@jdn5126 jdn5126 removed the stale Issue or PR is stale label Sep 11, 2023
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Nov 11, 2023
@jdn5126 jdn5126 removed the stale Issue or PR is stale label Nov 11, 2023
@OverStruck
Copy link

ETA plz

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 21, 2023

ETA plz

We are currently defining a solution for this. No ECD at this time

@jdn5126
Copy link
Contributor

jdn5126 commented Dec 26, 2023

Closing as fixed by #2697. This will ship in release v1.16.1 or v1.17.0

@jdn5126 jdn5126 closed this as completed Dec 26, 2023
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants