-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pivots to SCOS fail due to newer ext4 features enabled in FCOS #2041
Comments
Hi, Not sure if you saw either of these... https://okd.io/blog/2024/06/01/okd-future-statement/ So, you'll want to try an install of 4.16. Are you writing the ignition yourself? As the error notes, there is a mismatch of versions. |
Hi @JaimeMagiera, I've not seen this. The ignition files are written by |
I found releases here: https://github.com/okd-project/okd-scos/releases
How to download then v4.16 SCOS disk for bare-metal installation? |
Sorry for the confusion. That repository for OKD-SCOS is not relevant here. The current state of OKD is that the nodes start as FCOS, and boot into SCOS using rpm-ostree after the installer runs. You can use fedora-coreos-39.20231101.3.0 for your nodes. Currently, there are only nightly builds of OKD SCOS. We haven’t signed off on a GM release yet. We’re getting close and actually could use your help testing. Our ability to test bare-metal has been limited. A nightly that has passed E2E testing is here… Let us know how it goes. Thanks. |
Thank you for the link. But getting the installer and client looks very weird, no direct links to archives:
Imagine I have a new host and no Back to the actual problems, I managed to install a bootstrap node and 1 master node on bare-metal.
On the master node
Additionally, I was not able to install the second master node because all required containers (like
I came across with the same behaviour on v4.15. |
In terms of the chicken/egg situation, we have a new Community Testing page with a link to the oc binaries. https://okd.io/docs/community/community-testing/#getting-started |
Can you walk me through the process you're following to install OKD? I feel like there's something missing. Also, what is your bare-metal configuration? |
Seems I found something:
i.e. Bare metal config: |
Checking my servers, SCOS has:
But FCOS (I have another bare-metal server installed with v4.15 & Fedora CoreOS):
That's the reason why I didn't have any issues with |
The same results on v4.17. |
Just to confirm, you're starting with fedora-coreos-39.20231101.3.0 as I suggested above? |
Yes, I'm. |
This may be an issue specific to RAID devices. We're looking into it. Can you try on a cluster with non-RAID storage? |
This is software RAID configured by OKD during the installation like:
Let me think if I can try somewhere single-node non-RAID installation. |
Following this procedure https://docs.okd.io/4.16/installing/installing_sno/install-sno-installing-sno.html (single node installation, no RAID configured), I can't even install OKD 4.16.0-0.okd-scos-2024-09-24-151747.
|
@alrf, it is not your fault, SCOS transition is in-progress. I did a little researching effort to clarify the current status of OKD-SCOS and it is (historically) related to:
In short, since OKD 4.16, during initial bootstrap, the installer uses FCOS as base system and rebase it to SCOS using rpm-ostree (because OKD Working Group do not have enough capacity to build-and-ship SCOS live ISO directly). In the another side, Openshift uses RHCOS as base system, but they do not need pivot the image like us. This step is critical because FCOS does not include oc/podman/kubelet, but SCOS and RHCOS do. The blocking issue here is pivoting FCOS->SCOS: release-image-pivot.service. Reviewing the Until OKD 4.15, the installer pull the rpms files for oc, kubectl, kubelet and install them in the classic way
Since OKD 4.16, rpms files are replaced by stream-coreos / stream-coreos-extensions ostree:
|
@BeardOverflow thank you for the clarifications, but currently it is absolutely unclear how to deal with all these issues and what OKD version to install if needed. |
We do not currently have a final 4.16 release. Several nightly builds were dropped into the stable channel a few weeks ago because they passed AWS and vSphere IPI end-to-end tests (which is how OKD used to be released). We need more tasting from the community to find issues outside of IPI installs on the above platforms. |
@alrf Use the following patch to fix cat original.ign | jq 'walk(if type == "object" and .path == "/usr/local/bin/bootstrap-pivot.sh" then .contents.source = "data:text/plain;charset=utf-8;base64,IyEvdXNyL2Jpbi9lbnYgYmFzaApzZXQgLWV1byBwaXBlZmFpbAoKIyBFeGl0IGVhcmx5IGlmIHBpdm90IGlzIGF0dGVtcHRlZCBvbiBTQ09TIExpdmUgSVNPCnNvdXJjZSAvZXRjL29zLXJlbGVhc2UKaWYgW1sgISAkKHRvdWNoIC91c3IvLnRlc3QpIF1dICYmIFtbICR7SUR9ID1+IF4oY2VudG9zKSQgXV07IHRoZW4KICB0b3VjaCAvb3B0L29wZW5zaGlmdC8ucGl2b3QtZG9uZQogIGV4aXQgMApmaQoKIyBSZWJhc2UgdG8gT0tEJ3MgT1NUcmVlIGNvbnRhaW5lciBpbWFnZS4KIyBUaGlzIGlzIHJlcXVpcmVkIGluIE9LRCBhcyB0aGUgbm9kZSBpcyBmaXJzdCBwcm92aXNpb25lZCB3aXRoIHBsYWluIEZlZG9yYSBDb3JlT1MuCgojIHNoZWxsY2hlY2sgZGlzYWJsZT1TQzEwOTEKLiAvdXNyL2xvY2FsL2Jpbi9ib290c3RyYXAtc2VydmljZS1yZWNvcmQuc2gKLiAvdXNyL2xvY2FsL2Jpbi9yZWxlYXNlLWltYWdlLnNoCgojIFBpdm90IGJvb3RzdHJhcCBub2RlIHRvIE9LRCdzIE9TVHJlZSBpbWFnZQppZiBbICEgLWYgL29wdC9vcGVuc2hpZnQvLnBpdm90LWRvbmUgXTsgdGhlbgpNQUNISU5FX09TX0lNQUdFPSQoaW1hZ2VfZm9yIHN0cmVhbS1jb3Jlb3MpCmVjaG8gIlB1bGxpbmcgJHtNQUNISU5FX09TX0lNQUdFfS4uLiIKICB3aGlsZSB0cnVlCiAgZG8KICAgIHJlY29yZF9zZXJ2aWNlX3N0YWdlX3N0YXJ0ICJwdWxsLW9rZC1vcy1pbWFnZSIKICAgIGlmIHBvZG1hbiBwdWxsIC0tcXVpZXQgIiR7TUFDSElORV9PU19JTUFHRX0iCiAgICB0aGVuCiAgICAgICAgcmVjb3JkX3NlcnZpY2Vfc3RhZ2Vfc3VjY2VzcwogICAgICAgIGJyZWFrCiAgICBlbHNlCiAgICAgICAgcmVjb3JkX3NlcnZpY2Vfc3RhZ2VfZmFpbHVyZQogICAgICAgIGVjaG8gIlB1bGwgZmFpbGVkLiBSZXRyeWluZyAke01BQ0hJTkVfT1NfSU1BR0V9Li4uIgogICAgZmkKICBkb25lCgogIHJlY29yZF9zZXJ2aWNlX3N0YWdlX3N0YXJ0ICJyZWJhc2UtdG8tb2tkLW9zLWltYWdlIgogIG1udD0iJChwb2RtYW4gaW1hZ2UgbW91bnQgIiR7TUFDSElORV9PU19JTUFHRX0iKSIKICAjIFNOTyBzZXR1cCBib290cyBpbnRvIExpdmUgSVNPIHdoaWNoIGNhbm5vdCBiZSByZWJhc2VkCiAgIyBodHRwczovL2dpdGh1Yi5jb20vY29yZW9zL3JwbS1vc3RyZWUvaXNzdWVzLzQ1NDcKICAjbWtkaXIgL3Zhci9tbnQve3VwcGVyLHdvcmtlcn0KICAjbW91bnQgLXQgb3ZlcmxheSBvdmVybGF5IC1vICJsb3dlcmRpcj0vdXNyOiRtbnQvdXNyIiAvdXNyCiAgI21vdW50IC10IG92ZXJsYXkgb3ZlcmxheSAtbyAibG93ZXJkaXI9L2V0YzokbW50L2V0Yyx1cHBlcmRpcj0vdmFyL21udC91cHBlcix3b3JrZGlyPS92YXIvbW50L3dvcmtlciIgL2V0YwogIHJzeW5jIC1ybHR1ICRtbnQvZXRjLyAvZXRjLwogIGNwIC1hIC91c3IgL2xpYiAvbGliNjQgL3J1bi9lcGhlbWVyYWwKICByc3luYyAtcmx0IC0taWdub3JlLWV4aXN0aW5nICRtbnQvdXNyLyAvcnVuL2VwaGVtZXJhbC91c3IvCiAgcnN5bmMgLXJsdCAtLWlnbm9yZS1leGlzdGluZyAkbW50L2xpYi8gL3J1bi9lcGhlbWVyYWwvbGliLwogIHJzeW5jIC1ybHQgLS1pZ25vcmUtZXhpc3RpbmcgJG1udC9saWI2NC8gL3J1bi9lcGhlbWVyYWwvbGliNjQvCiAgbW91bnQgLS1iaW5kIC9ydW4vZXBoZW1lcmFsL3VzciAvdXNyCiAgbW91bnQgLS1iaW5kIC9ydW4vZXBoZW1lcmFsL2xpYiAvbGliCiAgbW91bnQgLS1iaW5kIC9ydW4vZXBoZW1lcmFsL2xpYjY0IC9saWI2NAogIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCgogICMgQXBwbHkgcHJlc2V0cyBmcm9tIE9LRCBNYWNoaW5lIE9TCiAgc3lzdGVtY3RsIHByZXNldC1hbGwKCiAgIyBXb3JrYXJvdW5kIGZvciBTRUxpbnV4IGRlbmlhbHMgd2hlbiBsYXVuY2hpbmcgY3Jpby5zZXJ2aWNlIGZyb20gb3ZlcmxheWZzCiAgc2V0ZW5mb3JjZSBQZXJtaXNzaXZlCgogICMgY3Jpby5zZXJ2aWNlIGlzIG5vdCBwYXJ0IG9mIEZDT1MgYnV0IG9mIE9LRCBNYWNoaW5lIE9TLiBJdCB3aWxsIGxvYWRlZCBhZnRlciBzeXN0ZW1jdGwgZGFlbW9uLXJlbG9hZCBhYm92ZSBidXQgaGFzIHRvIGJlIHN0YXJ0ZWQgbWFudWFsbHkKICBzeXN0ZW1jdGwgcmVzdGFydCBjcmlvLWNvbmZpZ3VyZS5zZXJ2aWNlCiAgc3lzdGVtY3RsIHN0YXJ0IGNyaW8uc2VydmljZQoKICB0b3VjaCAvb3B0L29wZW5zaGlmdC8ucGl2b3QtZG9uZQogIHBvZG1hbiBpbWFnZSB1bW91bnQgIiR7TUFDSElORV9PU19JTUFHRX0iCiAgcmVjb3JkX3NlcnZpY2Vfc3RhZ2Vfc3VjY2VzcwpmaQo=" else . end)' > fix.ign This patch works for most OKD-SCOS versions I tested until now (4.16, 4.17 and 4.18) uploaded on quay.io and openshift ci repositories. In detail, it mounts stream-coreos image as a container and extracts the content on a FCOS live image to avoid rpm-ostree. And please, do not consider the above patch as a final solution, it is a temporal fix until OKD workgroup release its own SCOS image. Also, @alrf , do not use openshift ci repository (registry.ci.openshift.org/origin/release-scos), because the artifacts are pruned periodically and your cluster will not be able to pull them; instead, use quay.io repository (quay.io/okd/scos-release). |
@BeardOverflow thank you for the patch and information, I will try it.
Does this mean that an already installed with |
@alrf Yes, you should reinstall your cluster using Check the release tags here (not "officially" released yet, but usables):
And get the installation artifacts from the same release path to avoid surprises (e.g.: 4.17.0-0.okd-scos-2024-09-30-101739):
I spent plenty of hours to figure it out! For nightly/experimental builds, you can get the installation artifacts from A last note (easiest path): also, you can get them using ocp's oc binary (e.g.: 4.17.0-0.okd-scos-2024-09-30-101739): curl -RL https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.17.0/openshift-client-linux-4.17.0.tar.gz -o- | tar zx
./oc adm release extract --tools quay.io/okd/scos-release:4.17.0-0.okd-scos-2024-09-30-101739
tar zxf openshift-client-linux-4.17.0-0.okd-scos-2024-09-30-101739.tar.gz # replaces previous ocp's oc binary to new okd's oc binary
tar zxf openshift-install-linux-4.17.0-0.okd-scos-2024-09-30-101739.tar.gz
curl -JLOR $(./openshift-install coreos print-stream-json | grep location | grep x86_64 | grep iso | cut -d\" -f4) |
Nightlies are pruned every 72 hours. Any installs based on nightlies are for testing only. These details are noted in our Community Testing document. |
@BeardOverflow I managed to install version 4.17.0-0.okd-scos-2024-09-30-101739 as a single node installation, no RAID configured, by using some tricks.
I got these issues again:
The installation failed in this case.
I got:
The fix in this case was:
Success, I got the server installed. In both cases I used the patched ignition config and cleared in advance with
The single-node
I will also try the regular RAID-configured installation. |
With normal, multi-host installation, I can't install the bootstrap node (version 4.17.0-0.okd-scos-2024-09-30-101739):
|
@alrf I updated the previous patch with a special condition to verify /run/ephemeral is writable:
I do not tested it yet, but theoretically the failure point should be fixed with it. |
@BeardOverflow I managed to install the bootstrap node with the new patch, however the master node (with RAID1) is partially installed:
|
@alrf I have not played enough with software RAID-1 (aka mirror devices) to say you how to fix your installation failures, but I can see some differences between your machine config and the reference docs. In your initial manifest files, you are using
Even if
If all of the above did not help you, consider to use a RAID-data volume instead (for /var or /var/lib/containers mount points only): |
@BeardOverflow I checked the journald more carefully and found
The same as previously in my comments: |
@alrf
However, SCOS image does implement a fix for ensuring a version equal or greater than 1.47.0 (coreos/coreos-assembler@1c280e7), but never reaches it because you are using coreos-installer from FCOS image instead. Ideas? A new patch for removing orphan_file feature from /etc/mke2fs.conf defaults
|
Trying to install on metal single node cluster, also running into the same issue For new installations this stream url can be found within the openshift-installer |
@AlexStorm1313 please read this issue carefully, it was already discussed: |
@BeardOverflow Is it possible to download somewhere the SCOS raw image? Or it should be built by the CoreOS Assembler (COSA) ? |
@BeardOverflow I've tried the latest patch, no changes,
|
@BeardOverflow @JaimeMagiera is it expected that
|
About SCOS building question: yes, you can build your own SCOS image using COSA (see scos-content:xxx-stream-coreos); but unfortunately, no public builds are available to download it. It is not excessively difficult to do, but you have to build a SCOS image every time a new tag is released. About 4.16.0-okd-scos.0 version question: yes, theoretically FCOS and SCOS both are currently supported for the installation process, but FCOS is being more prone to failure because it is not the target OS. In any case, FCOS still has some advantage, such as a more generic setup. About the latest patch, I think you are showing us an incorrect log here #2041 (comment) Please, review which OS is booting, because FCOS does have FEATURE_C12/orphan_file feature (and the patch must be apply to FCOS). Are you reading the logs after coreos-installer is finished? (in another words, after the first reboot) |
Thank you for clarifications.
Yes, it fails after the first reboot. I also provided the
|
@BeardOverflow the |
@JaimeMagiera @BeardOverflow the trick with /etc/mke2fs.conf doesn't work.
|
@alrf Sorry, I did not see your previous question. Yes, my patch just works before the first reboot. I am curious, are you executing your workaround before or after the first reboot? |
@BeardOverflow After the first reboot, already on SCOS. It doesn't make sense to execute it before (on FCOS). |
Uhm... I feel like there is a piece missing from this puzzle. Should not it be built the raid (with partition table included) before the first reboot? Because you are assuming a first mdadm/mkfs command if you use tune2fs in the second boot. I will convert to a question: who built the raid (FCOS or SCOS) and when (first or second boot)? To my knowledge, the answers are: FCOS on first boot, so also I think FCOS installer is not wiping the devices before using them [1] [1] https://docs.fedoraproject.org/en-US/fedora-coreos/storage/ |
Yes, the raid is built before the first reboot (i.e. on FCOS), you can check the time from the outputs below.
From the logs above, it is FCOS before the first boot. How I see the issue: the raid is built before the first boot on FCOS, but BEFORE the
That explains why the raid has FEATURE_C12/orphan_file feature enabled on the second boot (on SCOS) - the raid was already built with it. |
@alrf Sorry for the late answer. Please, take the following patch:
It writes |
@BeardOverflow The system doesn’t boot, I've attached the screenshot. |
@alrf I have tested fresh install SNO 4.17.0-okd-scos.0 + patch with no problem over baremetal machine. Also, in your screenshot, I see a hybrid system (BIOS+EFI). As testing, could you setup your server to EFI only? (disable legacy/CSM support). |
@BeardOverflow I tend to think this is the first boot (but unsure). This is OKD 4.16.0-okd-scos.1
Have you tested software raid or just single disk installation?
No, not really. |
I am sorry, I did a single disk installation. I am not familiar with software raid installations, can you share your However, your screenshot is enigmatic, it looks like a detection error between EFI-SYSTEM vs BIOS-BOOT on FCOS. I find it hard to imagine that the latest patch caused it (which it writes |
@BeardOverflow My install-config.yaml:
But RAID configuration is not in install-config.yaml, you can find it here: #2041 (comment). It is created from butane config (#2041 (comment)) as described in the documentation.
Could you please share an ignition example? Another question: what is the right way to install additional packages to OKD4 servers? (e.g. automation tools/configuration management tools) How it can impact the OKD4 updates? |
Sorry, I have not had time to test your setup. I redo my patch to adapt it to the above pull request and it works better than previous one. Please, try it out for your setup.
Do a second ignition file and prepare an installation service for e2fsprogs package [1]:
Best path may be a machine config object, which will install a service [1][2] with [1] https://upstreamwithoutapaddle.com/blog%20post/2023/05/21/Pull-Youself-Up-By-Your-Bootstraps.html |
This will be included as a known issue for the release of OKD 4.16/4.17 Recommended workarounds in the short term would be to start on older versions of FCOS (<39) if you need the RAID config. In the medium term, SCOS boot artifacts should be available shortly. |
@BeardOverflow the patch itself works (I mean the system is bootable), however doesn’t fix the FEATURE_C12/orphan_file issue. |
@BeardOverflow |
@BeardOverflow
|
OKD Version: 4.15.0-0.okd-2024-03-10-010116
FCOS Version: 39.20240210.3.0 (CoreOS)
https://api-int.mydomain.com:22623/config/master shows HTTP ERROR 500
In the logs of the
machine-config-server
container I see:The text was updated successfully, but these errors were encountered: