Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OL8: LVM volume not mounted on reboot after systemd-239-78.0.3 #136

Open
mvelikikh opened this issue Apr 29, 2024 · 8 comments
Open

OL8: LVM volume not mounted on reboot after systemd-239-78.0.3 #136

mvelikikh opened this issue Apr 29, 2024 · 8 comments

Comments

@mvelikikh
Copy link

LVM volumes are not always mounted after reboot after applying systemd-239-78.0.3 and above.
I constructed several test cases to demonstrate the issue using an Oracle provided AMI ami-076b18946a12c27d6 on AWS.
Here is a sample Cloud Formation template that is used to demonstrate the issue:
non-working-standard.yml.txt
User data:

yum install -y lvm2
yum update -y systemd
systemctl disable multipathd
nvme=$(lsblk -o NAME,SIZE | awk '/ 1G/ {print $1}')
pvcreate /dev/$nvme
vgcreate testvg /dev/$nvme
lvcreate -l 100%FREE -n u01 testvg
mkfs.xfs -f /dev/testvg/u01
echo '/dev/testvg/u01 /u01 xfs defaults 0 0' >> /etc/fstab
mkdir -p /u01
mount /u01

Once the template is deployed, confirm that cloud-init completed without errors and /u01 is mounted. Then reboot the EC2 instance, e.g. via reboot.
When it comes back, /u01 is not mounted anymore:

[ec2-user@ip-10-100-101-225 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        799M     0  799M   0% /dev
tmpfs           818M     0  818M   0% /dev/shm
tmpfs           818M   17M  802M   2% /run
tmpfs           818M     0  818M   0% /sys/fs/cgroup
/dev/nvme0n1p1   32G  2.4G   30G   8% /
tmpfs           164M     0  164M   0% /run/user/1000

/var/log/messages contains:

systemd[1]: dev-testvg-u01.device: Job dev-testvg-u01.device/start timed out.
systemd[1]: Timed out waiting for device dev-testvg-u01.device.
systemd[1]: Dependency failed for /u01.
systemd[1]: Dependency failed for Remote File Systems.
systemd[1]: remote-fs.target: Job remote-fs.target/start failed with result 'dependency'.
systemd[1]: u01.mount: Job u01.mount/start failed with result 'dependency'.
systemd[1]: dev-testvg-u01.device: Job dev-testvg-u01.device/start failed with result 'timeout'.

I created several Cloud Formation templates: test-cases.zip

  • non-working-standard: the deployment when systemd is updated to the currently available latest version 239-78.0.4 and multipathd is disabled. /u01 is not mounted on reboot
  • non-working-systemd: the deployment to demonstrate that /u01 is not mounted on reboot if systemd is updated to 239-78.0.3 - the version that introduced this problem
  • working-fstab-generator-reload-targets-disabled: the deployment where systemd-fstab-generator-reload-targets.service is disabled. It is the service that Oracle introduced in systemd-239-78.0.3. There is no such a service in the upstream. /u01 is mounted after reboot.
  • working-multipathd-enabled: the deployment where multipathd.service is enabled. /u01 is mounted after reboot
  • working-systemd: the deployment that uses systemd-239-78.0.1 - the one that is shipped with the AMI and it does not have the issue. /u01 is mounted on reboot

For each of the deployments above, I ran the following commands:

after deployment

date
sudo cloud-init status
df -h
rpm -q systemd
systemctl status multipathd
systemctl status systemd-fstab-generator-reload-targets
sudo reboot

after reboot

date
uptime
df -h
journalctl -b -o short-precise > /tmp/journalctl.txt
sudo cp /var/log/messages /tmp/messages.txt
sudo chmod o+r /tmp/messages.txt

The logs of the command executions are in the commands.txt files inside the archive along with journalctl.txt and messages.txt.

Thus, the issue happens when all of the following conditions are true:

  • systemd >= 239-78.0.3
  • multipathd disabled
  • there is a mount on top of LVM

The following workarounds are known to prevent the issue, so that an LVM volume /u01 is mounted after reboot:

  • use systemd < 239-78.0.3
  • enable multipathd
  • disable systemd-fstab-generator-reload-targets

I have been able to reproduce this issue only on AWS with different instance types (AMD/Intel based). I was not able to reproduce the issue on Azure with both NVMe and non-NVMe based VM sizes.
What is really happening here is that [email protected] is not invoked sometimes after applying systemd-239-78.0.3. Therefore, LVM auto-activation is not performed. If I reboot the EC2 instance and find that an LVM volume is not mounted, I can manually activate problem volume groups via vgchange -a y, or I can also run: sudo /usr/sbin/lvm pvscan --cache --activate ay 259:1 for a problem device as it is demonstrated below (the command used by [email protected]):

[ec2-user@ip-10-100-101-125 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        799M     0  799M   0% /dev
tmpfs           818M     0  818M   0% /dev/shm
tmpfs           818M   17M  802M   2% /run
tmpfs           818M     0  818M   0% /sys/fs/cgroup
/dev/nvme0n1p1   32G  2.4G   30G   8% /
tmpfs           164M     0  164M   0% /run/user/1000
[ec2-user@ip-10-100-101-125 ~]$ lsblk
NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0  32G  0 disk
└─nvme0n1p1 259:2    0  32G  0 part /
nvme1n1     259:1    0   1G  0 disk
[ec2-user@ip-10-100-101-125 ~]$ sudo /usr/sbin/lvm pvscan --cache --activate ay 259:1
  pvscan[905] PV /dev/nvme1n1 online, VG testvg is complete.
  pvscan[905] VG testvg run autoactivation.
  1 logical volume(s) in volume group "testvg" now active
[ec2-user@ip-10-100-101-125 ~]$ df -h
Filesystem              Size  Used Avail Use% Mounted on
devtmpfs                799M     0  799M   0% /dev
tmpfs                   818M     0  818M   0% /dev/shm
tmpfs                   818M   17M  802M   3% /run
tmpfs                   818M     0  818M   0% /sys/fs/cgroup
/dev/nvme0n1p1           32G  2.4G   30G   8% /
tmpfs                   164M     0  164M   0% /run/user/1000
/dev/mapper/testvg-u01 1016M   40M  977M   4% /u01
@YoderExMachina
Copy link
Member

Thanks for submitting this issue and providing the comprehensive info. We will take a look at this internally.

@mvelikikh
Copy link
Author

Would it be possible to provide an update on this issue? If any additional information is needed, please let me know.

@YoderExMachina
Copy link
Member

Please tell me if the following modifications temporarily address the issue? On my end, starting LVM/VG within the service file before "systemctl daemon-reload" is invoked worked.

With OL8 using systemd-252-78.0.3 or greater

(as root)
A)
systemctl disable systemd-fstab-generator-reload-targets.service

B) Remember to backup the service file (just in case), then change /usr/lib/systemd/system/systemd-fstab-generator-reload-targets.service to the following:

[Unit]
Description=systemd-fstab-generator-reload-targets.service
Documentation=man:systemd-fstab-generator
DefaultDependencies=no
Wants=local-fs-pre.target
After=local-fs-pre.target
Before=local-fs.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=-/bin/sh -c "/sbin/vgscan"
ExecStart=-/bin/sh -c "/sbin/vgchange -ay"
ExecStart=-/bin/sh -c "/bin/systemctl daemon-reload"

[Install]
WantedBy=local-fs.target

C) systemctl enable systemd-fstab-generator-reload-targets.service

D) Try rebooting multiple times to ensure it works

@mvelikikh
Copy link
Author

Yes, we have discussed this and will be opening an internal tick to track the issue. Can you let me know what versions of OL you have tested this on?

It is Oracle Linux 8.9. ami-076b18946a12c27d6 provided by Oracle on AWS.

Please tell me if the following modifications temporarily address the issue? On my end, starting LVM/VG within the service file before "systemctl daemon-reload" is invoked worked.

Yes, it does work fine. I bounced the system 6 times and the LV was mounted after each reboot:

[ec2-user@ip-172-31-34-92 ~]$ last reboot
reboot   system boot  5.15.0-200.131.2 Tue May 21 17:55   still running
reboot   system boot  5.15.0-200.131.2 Tue May 21 17:54 - 17:55  (00:00)
reboot   system boot  5.15.0-200.131.2 Tue May 21 17:54 - 17:54  (00:00)
reboot   system boot  5.15.0-200.131.2 Tue May 21 17:53 - 17:53  (00:00)
reboot   system boot  5.15.0-200.131.2 Tue May 21 17:50 - 17:52  (00:02)
reboot   system boot  5.15.0-200.131.2 Tue May 21 17:46 - 17:50  (00:03)

wtmp begins Tue May 21 17:46:26 2024
[ec2-user@ip-172-31-34-92 ~]$ journalctl -u systemd-fstab-generator-reload-targets.service
-- Logs begin at Tue 2024-05-21 17:55:45 GMT, end at Tue 2024-05-21 17:56:19 GMT. --
May 21 17:55:45 ip-172-31-34-92.ec2.internal systemd[1]: Starting systemd-fstab-generator-reload-targets.service...
May 21 17:55:45 ip-172-31-34-92.ec2.internal sh[479]:   Found volume group "testvg" using metadata type lvm2
May 21 17:55:45 ip-172-31-34-92.ec2.internal sh[491]:   1 logical volume(s) in volume group "testvg" now active
May 21 17:55:46 ip-172-31-34-92.ec2.internal systemd[1]: Started systemd-fstab-generator-reload-targets.service.

@mvelikikh
Copy link
Author

Could you please provide an update about the fix and its timeline? We are still hitting the same issue even with Oracle Linux 8.10.

working-fstab-generator-reload-targets-disabled: the deployment where systemd-fstab-generator-reload-targets.service is disabled. It is the service that Oracle introduced in systemd-239-78.0.3. There is no such a service in the upstream. /u01 is mounted after reboot.

We found that although the problem did not happen in simple scenarios, it still happens in more complex tests even when systemd-fstab-generator-reload-targets.service was masked. Therefore, disabling or masking systemd-fstab-generator-reload-targets.service cannot be used as a workaround.

@YoderExMachina
Copy link
Member

Hi, I'm afraid we can't provide an ETA, but I have followed up with the developer to see if they need anything else.

@shaunmcevoy
Copy link

Hi, any updates on this issue. We appear to have encountered the same issue on a few OL_8.10 servers today.

@Bouncy-Handrail
Copy link

We encountered the same problem after the upgrade on some servers, are there any updates on the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants