Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(实则因为本项目嵌套本项目开设的问题,非嵌套本项目应该无问题) 宿主机:debian12 安装PVE,宿主机重启网络后虚拟机的tap设备丢失无法自创建和链接,需要虚拟机本身关机重启解决/使用OVS替代网桥实现NAT #21

Open
spiritLHLS opened this issue Aug 22, 2024 · 33 comments
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed

Comments

@spiritLHLS
Copy link
Collaborator

spiritLHLS commented Aug 22, 2024

debian12系统安装了虚拟化项目,开设的nat kvm虚拟机 运行使用一段时间会断网,从pev控制台进入nat kvm虚拟机 ping 172.16.1.1 也不通,执行reboot重启也不通,是所有的nat kvm 虚拟机同时出现不通,只能在pve web控制台点击虚拟机然后再点右上方的关闭菜单机选择等待重启重启后虚拟机网络可以恢复,但是这个没网络时不会出现,不知道是哪里的问题。debian12 主机自身网络正常。

auto lo
iface lo inet loopback
auto vmbr0
iface vmbr0 inet static
    address 110.xx.xx.xx/24
    gateway 110.xx.xx.1
    bridge_ports eth0
    bridge_stp off
    bridge_fd 0

iface vmbr0 inet6 static
    address 240e:x:x:x::x:x/128
    gateway 240e:x:x:x::5064:1
    up ip addr del fe80::be24:11ff:feb6:c5c2/64 dev eth0
auto vmbr1
iface vmbr1 inet static
    address 172.16.1.1
    netmask 255.255.255.0
    bridge_ports none
    bridge_stp off
    bridge_fd 0
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward
    post-up echo 1 > /proc/sys/net/ipv4/conf/vmbr1/proxy_arp
    post-up iptables -t nat -A POSTROUTING -s '172.16.1.0/24' -o vmbr0 -j MASQUERADE
    post-down iptables -t nat -D POSTROUTING -s '172.16.1.0/24' -o vmbr0 -j MASQUERADE

iface vmbr1 inet6 static
    address 2001:db8:1::1/64
    post-up sysctl -w net.ipv6.conf.all.forwarding=1
    post-up ip6tables -t nat -A POSTROUTING -s 2001:db8:1::/64 -o vmbr0 -j MASQUERADE
    post-down sysctl -w net.ipv6.conf.all.forwarding=0
    post-down ip6tables -t nat -D POSTROUTING -s 2001:db8:1::/64 -o vmbr0 -j MASQUERADE

Originally posted by @wbews in #11 (comment)

@spiritLHLS
Copy link
Collaborator Author

spiritLHLS commented Aug 22, 2024

我怀疑是你网关掉了,被什么东西卡掉了

出现这种情况后,有试过在宿主机上ping 172.16.1.1

还有

brctl show

@wbews

@spiritLHLS
Copy link
Collaborator Author

spiritLHLS commented Aug 22, 2024

虚拟机内执行以下命令截取最新的20行给我

cat /var/log/messages
cat /var/log/syslog

@wbews

This comment was marked as resolved.

@spiritLHLS
Copy link
Collaborator Author

你好,我刚才pve_delete.sh 107 删除了一个未启动的虚拟机。 然后nat kvm 所有虚拟机都掉网了。 这正常吗。

正常

https://github.com/oneclickvirt/pve/blob/main/scripts/pve_delete.sh#L63C1-L67C37

删除后会重启整个宿主机的网络,重载NAT映射

@spiritLHLS
Copy link
Collaborator Author

Aug 22 14:45:00 VM102 systemd[1]: cloud-config.service: Failed with result 'exit-code'.
Aug 22 14:45:00 VM102 systemd[1]: Failed to start Apply the settings specified in cloud-config.

Aug 22 14:45:00 VM102 systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Aug 22 14:45:00 VM102 systemd[1]: Failed to start Execute cloud user/final scripts.

日志可以看到 cloud-init 有点问题,不知道是不是这个原因导致的

@spiritLHLS
Copy link
Collaborator Author

Aug 22 14:45:00 VM102 systemd[1]: cloud-config.service: Failed with result 'exit-code'. Aug 22 14:45:00 VM102 systemd[1]: Failed to start Apply the settings specified in cloud-config.

Aug 22 14:45:00 VM102 systemd[1]: cloud-final.service: Failed with result 'exit-code'. Aug 22 14:45:00 VM102 systemd[1]: Failed to start Execute cloud user/final scripts.

日志可以看到 cloud-init 有点问题,不知道是不是这个原因导致的

cat /etc/cloud/cloud.cfg

虚拟机内看看配置

@spiritLHLS spiritLHLS added the bug Something isn't working label Aug 22, 2024
@spiritLHLS
Copy link
Collaborator Author

宿主机内网关能ping通代表外面网络配置没啥问题,有问题的是虚拟机内部的配置

@spiritLHLS
Copy link
Collaborator Author

如果可以你可以试试开不同的系统的虚拟机,看看是不是仅一个类型的系统有问题

@wbews
Copy link

wbews commented Aug 22, 2024

service networking restart
systemctl restart networking.service

删除时重启网络,虚拟机不会自动恢复,只能从控制台重启是吧?

虚拟机 cloud.cfg

root@VM102:~# cat /etc/cloud/cloud.cfg

The top level settings are used as module

and system configuration.

A set of users which may be applied and/or used by various modules

when a 'default' entry is found it will reference the 'default_user'

from the distro configuration specified below

users:

  • default

If this is set, 'root' will not be able to ssh in and they

will get a message to login instead as the above $user (debian)

disable_root: true

This will cause the set+update hostname module to not operate (if true)

preserve_hostname: false

This prevents cloud-init from rewriting apt's sources.list file,

which has been a source of surprise.

apt_preserve_sources_list: true

Example datasource config

datasource:

Ec2:

metadata_urls: [ 'blah.com' ]

timeout: 5 # (defaults to 50 seconds)

max_wait: 10 # (defaults to 120 seconds)

The modules that run in the 'init' stage

cloud_init_modules:

  • migrator
  • seed_random
  • bootcmd
  • write-files
  • growpart
  • resizefs
  • disk_setup
  • mounts
  • set_hostname
  • update_hostname
  • update_etc_hosts
  • ca-certs
  • rsyslog
  • users-groups
  • ssh

The modules that run in the 'config' stage

cloud_config_modules:

Emit the cloud config ready event

this can be used by upstart jobs for 'start on cloud-config'.

  • emit_upstart
  • ssh-import-id
  • locale
  • set-passwords
  • grub-dpkg
  • apt-pipelining
  • apt-configure
  • ntp
  • timezone
  • disable-ec2-metadata
  • runcmd
  • byobu

The modules that run in the 'final' stage

cloud_final_modules:

  • package-update-upgrade-install
  • fan
  • puppet
  • chef
  • salt-minion
  • mcollective
  • rightscale_userdata
  • scripts-vendor
  • scripts-per-once
  • scripts-per-boot
  • scripts-per-instance
  • scripts-user
  • ssh-authkey-fingerprints
  • keys-to-console
  • phone-home
  • final-message
  • power-state-change

System and/or distro specific settings

(not accessible to handlers/transforms)

system_info:

This will affect which distro class gets used

distro: debian

Default user name + that default users groups (if added/used)

default_user:
name: debian
lock_passwd: True
gecos: Debian
groups: [adm, audio, cdrom, dialout, dip, floppy, netdev, plugdev, sudo, video]
sudo: ["ALL=(ALL) NOPASSWD:ALL"]
shell: /bin/bash

Other config here will be given to the distro class and/or path classes

paths:
cloud_dir: /var/lib/cloud/
templates_dir: /etc/cloud/templates/
upstart_dir: /etc/init/
package_mirrors:
- arches: [default]
failsafe:
primary: http://deb.debian.org/debian
security: http://security.debian.org/
ssh_svcname: ssh
root@VM102:~#

@spiritLHLS
Copy link
Collaborator Author

service networking restart
systemctl restart networking.service
删除时重启网络,虚拟机不会自动恢复,只能从控制台重启是吧?

有这个可能,你可以遇到这种情况的时候试试

systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

重启PVE的服务看看,有没有启动虚拟机网络

@wbews
Copy link

wbews commented Aug 22, 2024

重启PVE服务,虚拟机网络没有启动。

@spiritLHLS
Copy link
Collaborator Author

spiritLHLS commented Aug 22, 2024

ifdown vmbr0 && ifup vmbr0

ifdown vmbr1 && ifup vmbr1

停了网桥再启动网桥呢?

@wbews
Copy link

wbews commented Aug 22, 2024

也不行哦。
root@pve:# ifdown vmbr1 && ifup vmbr1
root@pve:
# brctl show
bridge name bridge id STP enabled interfaces
vmbr0 8000.bc24111480bd no eth0
vmbr1 8000.000000000000 no
root@pve:# brctl show
bridge name bridge id STP enabled interfaces
vmbr0 8000.bc24111480bd no eth0
vmbr1 8000.000000000000 no
root@pve:
# brctl show
bridge name bridge id STP enabled interfaces
vmbr0 8000.bc24111480bd no eth0
vmbr1 8000.000000000000 no
root@pve:# ifdown vmbr0 && ifup vmbr0
warning: vmbr0: up cmd 'ip addr del fe80::be24:11ff:feb6:c5c2/64 dev eth0' failed: returned 2 (Error: ipv6: address not found.
)
root@pve:
# ifdown vmbr0 && ifup vmbr0
warning: vmbr0: up cmd 'ip addr del fe80::be24:11ff:feb6:c5c2/64 dev eth0' failed: returned 2 (Error: ipv6: address not found.
)
root@pve:~# brctl show
bridge name bridge id STP enabled interfaces
vmbr0 8000.bc24111480bd no eth0
vmbr1 8000.000000000000 no

@spiritLHLS
Copy link
Collaborator Author

嘶,那可真怪了,宿主机外重启网络管不到虚拟机网络还行

@wbews
Copy link

wbews commented Aug 22, 2024

service networking restart
systemctl restart networking.service
宿主机重启网络。虚拟机就只能冷重启才能恢复网络,reboot 也是无效。

@spiritLHLS
Copy link
Collaborator Author

spiritLHLS commented Aug 22, 2024

qm agent $vmid network-interfaces-flush

不知道你用的镜像有没有装QEMU Guest Agent,如果有这样刷新一下内部网络接口不知道有没有用

$vmid 写你虚拟机编号 102 103 什么的

@spiritLHLS
Copy link
Collaborator Author

你原先的问题大概也是类似的毛病,外面网络自重启了,虚拟机的tap设备丢失了,虚拟机就只能冷重启才能恢复网络

@spiritLHLS spiritLHLS added the help wanted Extra attention is needed label Aug 22, 2024
@wbews
Copy link

wbews commented Aug 22, 2024

是的,应该是这样导致的。这个命令刷新不了!
root@pve:~# qm agent 102 network-interfaces-flush
400 Parameter verification failed.
command: value 'network-interfaces-flush' does not have a value in the enumeration 'fsfreeze-freeze, fsfreeze-status, fsfreeze-thaw, fstrim, get-fsinfo, get-host-name, get-memory-block-info, get-memory-blocks, get-osinfo, get-time, get-timezone, get-users, get-vcpus, info, network-get-interfaces, ping, shutdown, suspend-disk, suspend-hybrid, suspend-ram'
qm guest cmd

@spiritLHLS
Copy link
Collaborator Author

是的,应该是这样导致的。这个命令刷新不了! root@pve:~# qm agent 102 network-interfaces-flush 400 Parameter verification failed. command: value 'network-interfaces-flush' does not have a value in the enumeration 'fsfreeze-freeze, fsfreeze-status, fsfreeze-thaw, fstrim, get-fsinfo, get-host-name, get-memory-block-info, get-memory-blocks, get-osinfo, get-time, get-timezone, get-users, get-vcpus, info, network-get-interfaces, ping, shutdown, suspend-disk, suspend-hybrid, suspend-ram' qm guest cmd

web面板冷启动实际应该也是关掉虚拟机启动虚拟机吧

qm shutdown 102
qm start 102

直接这样命令重启是不是也有效果?你试试

@wbews
Copy link

wbews commented Aug 22, 2024

这样可以的,删除虚拟机是必须重启宿主机网络吗

@spiritLHLS
Copy link
Collaborator Author

这样可以的,删除虚拟机是必须重启宿主机网络吗

不必要,刚刚我已经删除了对应部分的内容

@spiritLHLS
Copy link
Collaborator Author

是的,应该是这样导致的。这个命令刷新不了! root@pve:~# qm agent 102 network-interfaces-flush 400 Parameter verification failed. command: value 'network-interfaces-flush' does not have a value in the enumeration 'fsfreeze-freeze, fsfreeze-status, fsfreeze-thaw, fstrim, get-fsinfo, get-host-name, get-memory-block-info, get-memory-blocks, get-osinfo, get-time, get-timezone, get-users, get-vcpus, info, network-get-interfaces, ping, shutdown, suspend-disk, suspend-hybrid, suspend-ram' qm guest cmd

web面板冷启动实际应该也是关掉虚拟机启动虚拟机吧

qm shutdown 102 qm start 102

直接这样命令重启是不是也有效果?你试试

ifreload -a

重载接口文件的命令

虽然我觉得也不顶用,tap设备可能还是没有自创建和链接网桥

彻底解决这个问题得上 OVS 了大概,网桥的增强版

@spiritLHLS spiritLHLS added the enhancement New feature or request label Aug 22, 2024
@spiritLHLS spiritLHLS changed the title debian12系统安装了虚拟化项目,开设的nat kvm虚拟机 运行使用一段时间会断网,从pev控制台进入nat kvm虚拟机 ping 172.16.1.1 也不通,执行reboot重启也不通,是所有的nat kvm 虚拟机同时出现不通,只能在pve web控制台点击虚拟机然后再点右上方的关闭菜单机选择等待重启重启后虚拟机网络可以恢复,但是这个没网络时不会出现,不知道是哪里的问题。debian12 主机自身网络正常。 宿主机:debian12 安装PVE,宿主机重启网络后虚拟机的tap设备丢失无法自创建和链接,需要虚拟机本身关机重启解决/使用OVS替代网桥实现NAT Aug 22, 2024
@spiritLHLS
Copy link
Collaborator Author

web面板冷启动实际应该也是关掉虚拟机启动虚拟机吧

qm shutdown 102 qm start 102

直接这样命令重启是不是也有效果?你试试

自动化版本重启虚拟机的玩意:

#!/bin/bash
running_vms=$(qm list | awk '$3 == "running" {print $1}')
if [ -z "$running_vms" ]; then
    echo "没有运行中的虚拟机。"
    exit 0
fi
echo "以下虚拟机将被关闭然后重新启动:"
echo "$running_vms"
for vm in $running_vms; do
    qm shutdown $vm
    while qm status $vm | grep -q running; do
        sleep 5
    done
    qm start $vm
    while ! qm status $vm | grep -q running; do
        sleep 5
    done
    sleep 1
done

@wbews
Copy link

wbews commented Aug 22, 2024

我发现有的时候,宿主机或者pve上冷重启虚拟机没反应,pve提示超时,这时候还得在控制台先reboot一下,然后再次立即shutodn 🤣

@spiritLHLS
Copy link
Collaborator Author

我发现有的时候,宿主机或者pve上冷重启虚拟机没反应,pve提示超时,这时候还得在控制台先reboot一下,然后再次立即shutodn 🤣

什么商家的服务器啊,这么多问题的?

@wbews
Copy link

wbews commented Aug 22, 2024

鸡仔云 其它商家重启网络虚拟机不会断网吗?

@spiritLHLS
Copy link
Collaborator Author

靠,怎么又是这家的东西,见

#20

本项目不支持嵌套再嵌套啊

@spiritLHLS spiritLHLS changed the title 宿主机:debian12 安装PVE,宿主机重启网络后虚拟机的tap设备丢失无法自创建和链接,需要虚拟机本身关机重启解决/使用OVS替代网桥实现NAT 宿主机:debian12 安装PVE,宿主机重启网络后虚拟机的tap设备丢失无法自创建和链接,需要虚拟机本身关机重启解决/使用OVS替代网桥实现NAT (实则因为本项目嵌套本项目开设的问题) Aug 22, 2024
@wbews
Copy link

wbews commented Aug 22, 2024

#20 刚看了,他家也是用的这个项目 🤣

@spiritLHLS
Copy link
Collaborator Author

绝了,感觉是嵌套出毛病了,但我不知道具体毛病在哪里,是我才疏学浅了

@spiritLHLS spiritLHLS changed the title 宿主机:debian12 安装PVE,宿主机重启网络后虚拟机的tap设备丢失无法自创建和链接,需要虚拟机本身关机重启解决/使用OVS替代网桥实现NAT (实则因为本项目嵌套本项目开设的问题) (实则因为本项目嵌套本项目开设的问题,非嵌套本项目应该无问题) 宿主机:debian12 安装PVE,宿主机重启网络后虚拟机的tap设备丢失无法自创建和链接,需要虚拟机本身关机重启解决/使用OVS替代网桥实现NAT Aug 22, 2024
@spiritLHLS
Copy link
Collaborator Author

暂时先这么着吧,待哪天哪个有缘人找到问题再关闭本问题了,留着先

@spiritLHLS
Copy link
Collaborator Author

spiritLHLS commented Aug 22, 2024

鸡仔云 其它商家重启网络虚拟机不会断网吗?

我没遇到过,也没有其他用户反馈过这个问题

使用本项目开设PVE嵌套再嵌套PVE这种操作非常少见

@wbews
Copy link

wbews commented Aug 22, 2024

好的,感谢

@spiritLHLS
Copy link
Collaborator Author

好的,感谢

非KVM需求用LXD/INCUS就不会出这种问题了大概,配置方面应该不冲突了这样

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants