Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[求助/Help]probe-isolated-devices:Send error request process timeout #21739

Closed
chenjacken opened this issue Dec 4, 2024 · 3 comments
Closed
Labels
question Further information is requested

Comments

@chenjacken
Copy link

chenjacken commented Dec 4, 2024

Version:3.11.8
OS:Centos 7.9


Web查看透传设备,报错:
image

Error Info:

kubectl logs default-host-qksqz -n onecloud -c host --tail 100 -f

info 2024-12-04 22:34:58 isolated_device.getPassthroughGPUs(gpu.go:86)] filter address [], enableWhiteList: false
[warning 2024-12-04 22:35:04 isolated_device.NewPCIDevice2(gpu.go:241)] fillPCIEInfo for line: "00:16.0 \"Communication controller [0780]\" \"Intel Corporation [8086]\" \"C620 Series Chipset Family MEI Controller #1 [a1ba]\" -r09 \"ASUSTeK Computer Inc. [1043]\" \"Device [871e]\"", device: {}, error: device address is empty: {}
[info 2024-12-04 22:35:04 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address  is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[warning 2024-12-04 22:35:04 isolated_device.NewPCIDevice2(gpu.go:241)] fillPCIEInfo for line: "00:16.1 \"Communication controller [0780]\" \"Intel Corporation [8086]\" \"C620 Series Chipset Family MEI Controller #2 [a1bb]\" -r09 \"ASUSTeK Computer Inc. [1043]\" \"Device [871e]\"", device: {}, error: device address is empty: {}
[info 2024-12-04 22:35:04 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address  is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[warning 2024-12-04 22:35:04 isolated_device.NewPCIDevice2(gpu.go:241)] fillPCIEInfo for line: "00:16.4 \"Communication controller [0780]\" \"Intel Corporation [8086]\" \"C620 Series Chipset Family MEI Controller #3 [a1be]\" -r09 \"ASUSTeK Computer Inc. [1043]\" \"Device [871e]\"", device: {}, error: device address is empty: {}
[info 2024-12-04 22:35:04 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address  is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[warning 2024-12-04 22:35:05 isolated_device.NewPCIDevice2(gpu.go:241)] fillPCIEInfo for line: "00:1c.0 \"PCI bridge [0604]\" \"Intel Corporation [8086]\" \"C620 Series Chipset Family PCI Express Root Port #1 [a190]\" -rf9 \"\" \"\"", device: {}, error: device address is empty: {}
[info 2024-12-04 22:35:05 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address  is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[warning 2024-12-04 22:35:05 isolated_device.NewPCIDevice2(gpu.go:241)] fillPCIEInfo for line: "00:1c.3 \"PCI bridge [0604]\" \"Intel Corporation [8086]\" \"C620 Series Chipset Family PCI Express Root Port #4 [a193]\" -rf9 \"\" \"\"", device: {}, error: device address is empty: {}
[info 2024-12-04 22:35:05 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address  is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[info 2024-12-04 22:35:07 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address 02:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[info 2024-12-04 22:35:07 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:397)] PCI address 03:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/0000:03:00.0/boot_vga
[info 2024-12-04 22:35:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"1d:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[info 2024-12-04 22:35:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"20:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[info 2024-12-04 22:35:39 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"21:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[warning 2024-12-04 22:35:41 appsrv.do_worker_watchdog(workers_watchdog.go:64)] WorkerManager HttpRequestWorkerManager has been busy for 2 cycles...
[info 2024-12-04 22:35:41 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"24:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[warning 2024-12-04 22:35:58 appsrv.(*SWorker).Detach(workers.go:125)] detach worker #24(0xc001f41890, detach) POST /hosts/32bc16ab-0e62-422c-8a79-2b9bcfe27094/probe-isolated-devices(POST /hosts/32bc16ab-0e62-422c-8a79-2b9bcfe27094/probe-isolated-devices) due to reason timeout after 1m0.000226909s
[error 2024-12-04 22:35:58 httperrors.HTTPError(httperrors.go:110)] Send error request process timeout
goroutine 417628 [running]:
runtime/debug.Stack()
        /usr/lib/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
        /usr/lib/go/src/runtime/debug/stack.go:16 +0x13
yunion.io/x/onecloud/pkg/httperrors.HTTPError({0x3b72be8?, 0xc001f41a10?}, {0x3b62250?, 0xc001f41710?}, {0x36307e7, 0x17}, 0x1f8, {0x35ece4a, 0xc}, {{0x36307e7, ...}, ...})
        /root/go/src/yunion.io/x/onecloud/pkg/httperrors/httperrors.go:112 +0x3e5
yunion.io/x/onecloud/pkg/httperrors.JsonClientError(...)
        /root/go/src/yunion.io/x/onecloud/pkg/httperrors/httperrors.go:117
yunion.io/x/onecloud/pkg/httperrors.GeneralServerError({0x3b72be8, 0xc001f41a10}, {0x3b62250, 0xc001f41710}, {0x3b50a80?, 0xc000e85ef0?})
        /root/go/src/yunion.io/x/onecloud/pkg/httperrors/httperrors.go:122 +0xcd
yunion.io/x/onecloud/pkg/appsrv.(*Application).defaultHandle(0xc000944000, {0x3b62250?, 0xc001f41710}, 0xc000b36b00, {0xc00089ec78, 0x14})
        /root/go/src/yunion.io/x/onecloud/pkg/appsrv/appsrv.go:425 +0xcb6
yunion.io/x/onecloud/pkg/appsrv.(*Application).ServeHTTP(0xc000944000, {0x3b62940, 0xc001af67e0}, 0xc000b36b00)
        /root/go/src/yunion.io/x/onecloud/pkg/appsrv/appsrv.go:258 +0x20b
net/http.serverHandler.ServeHTTP({0x3b5a758?}, {0x3b62940?, 0xc001af67e0?}, 0x6?)
        /usr/lib/go/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc0015986c0, {0x3b72be8, 0xc000fdcc00})
        /usr/lib/go/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 1
        /usr/lib/go/src/net/http/server.go:3086 +0x5cb
[info 2024-12-04 22:35:58 appsrv.(*Application).ServeHTTP(appsrv.go:289)] eRQl5K-V0hyE4u5NgID8BcyaLww= 504 926596-f3b957-34b980 POST /hosts/32bc16ab-0e62-422c-8a79-2b9bcfe27094/probe-isolated-devices (172.16.0.13:39552:compute_v2) 60000.59ms
[warning 2024-12-04 22:36:11 appsrv.do_worker_watchdog(workers_watchdog.go:64)] WorkerManager HttpRequestWorkerManager has been busy for 3 cycles...
[warning 2024-12-04 22:36:41 appsrv.do_worker_watchdog(workers_watchdog.go:64)] WorkerManager HttpRequestWorkerManager has been busy for 4 cycles...
[info 2024-12-04 22:36:43 isolated_device.(*isolatedDeviceManager).probeGPUS(isolated_device.go:167)] Add GPU device: 0 => &isolated_device.PCIDevice{Addr:"1d:00.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"NVIDIA Corporation", VendorId:"10de", DeviceName:"Device", DeviceId:"2684", SubvendorName:"NVIDIA Corporation", SubvendorId:"10de", SubdeviceName:"Device", SubdeviceId:"167c", ModelName:"", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice{(*isolated_device.PCIDevice)(0xc000249260)}, PCIEInfo:(*compute.IsolatedDevicePCIEInfo)(0xc00198c1c0)}
[info 2024-12-04 22:36:43 isolated_device.(*isolatedDeviceManager).probeGPUS(isolated_device.go:167)] Add GPU device: 1 => &isolated_device.PCIDevice{Addr:"20:00.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"NVIDIA Corporation", VendorId:"10de", DeviceName:"Device", DeviceId:"2684", SubvendorName:"NVIDIA Corporation", SubvendorId:"10de", SubdeviceName:"Device", SubdeviceId:"167c", ModelName:"", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice{(*isolated_device.PCIDevice)(0xc000852460)}, PCIEInfo:(*compute.IsolatedDevicePCIEInfo)(0xc001c913c0)}
[info 2024-12-04 22:36:43 isolated_device.(*isolatedDeviceManager).probeGPUS(isolated_device.go:167)] Add GPU device: 2 => &isolated_device.PCIDevice{Addr:"21:00.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"NVIDIA Corporation", VendorId:"10de", DeviceName:"Device", DeviceId:"2684", SubvendorName:"NVIDIA Corporation", SubvendorId:"10de", SubdeviceName:"Device", SubdeviceId:"167c", ModelName:"", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice{(*isolated_device.PCIDevice)(0xc002b507e0)}, PCIEInfo:(*compute.IsolatedDevicePCIEInfo)(0xc0023b8580)}
[info 2024-12-04 22:36:43 isolated_device.(*isolatedDeviceManager).probeGPUS(isolated_device.go:167)] Add GPU device: 3 => &isolated_device.PCIDevice{Addr:"24:00.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"NVIDIA Corporation", VendorId:"10de", DeviceName:"Device", DeviceId:"2684", SubvendorName:"NVIDIA Corporation", SubvendorId:"10de", SubdeviceName:"Device", SubdeviceId:"167c", ModelName:"", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice{(*isolated_device.PCIDevice)(0xc00299a700)}, PCIEInfo:(*compute.IsolatedDevicePCIEInfo)(0xc0028b9140)}
[info 2024-12-04 22:36:46 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"24:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[info 2024-12-04 22:36:46 isolated_device.SyncDeviceInfo(isolated_device.go:478)] Update 7d2e69e1-e19c-4e0e-8d4c-01f61f9ac443 isolated_device: {"addr":"24:00.0","detected_on_host":true,"dev_type":"GPU-HPC","host_id":"32bc16ab-0e62-422c-8a79-2b9bcfe27094","id":"7d2e69e1-e19c-4e0e-8d4c-01f61f9ac443","model":"Device","numa_node":0,"pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"vendor_device_id":"10de:2684"}
[info 2024-12-04 22:36:46 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"21:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[info 2024-12-04 22:36:46 isolated_device.SyncDeviceInfo(isolated_device.go:478)] Update 7c119893-5082-43e9-80e5-0332923fe051 isolated_device: {"addr":"21:00.0","detected_on_host":true,"dev_type":"GPU-HPC","host_id":"32bc16ab-0e62-422c-8a79-2b9bcfe27094","id":"7c119893-5082-43e9-80e5-0332923fe051","model":"Device","numa_node":0,"pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"vendor_device_id":"10de:2684"}
[info 2024-12-04 22:36:46 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"20:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[info 2024-12-04 22:36:46 isolated_device.SyncDeviceInfo(isolated_device.go:478)] Update eb381ab8-efbf-419d-8a98-02221ab172b8 isolated_device: {"addr":"20:00.0","detected_on_host":true,"dev_type":"GPU-HPC","host_id":"32bc16ab-0e62-422c-8a79-2b9bcfe27094","id":"eb381ab8-efbf-419d-8a98-02221ab172b8","model":"Device","numa_node":0,"pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"vendor_device_id":"10de:2684"}
[info 2024-12-04 22:36:46 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:428)] {"bus_id":"1d:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"subdevice_id":"167c","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver
[info 2024-12-04 22:36:46 isolated_device.SyncDeviceInfo(isolated_device.go:478)] Update f5b40d5a-acc5-4b60-8445-49850afb6b9e isolated_device: {"addr":"1d:00.0","detected_on_host":true,"dev_type":"GPU-HPC","host_id":"32bc16ab-0e62-422c-8a79-2b9bcfe27094","id":"f5b40d5a-acc5-4b60-8445-49850afb6b9e","model":"Device","numa_node":0,"pcie_info":{"lane_width":16,"throughput":"31.50 GB/s","transfer_rate_per_lane":"16GT/s","version":"4.0"},"vendor_device_id":"10de:2684"}

Thanks!!

@chenjacken chenjacken added the question Further information is requested label Dec 4, 2024
@wanyaoqi
Copy link
Member

wanyaoqi commented Dec 9, 2024

@chenjacken 每次访问这个页面的时候会重新探测宿主机的透传设备。这个接口是可能会超时,我们优化一下

@chenjacken
Copy link
Author

@chenjacken 每次访问这个页面的时候会重新探测宿主机的透传设备。这个接口是可能会超时,我们优化一下

嗯嗯,好的好的,谢谢!!

@wanyaoqi
Copy link
Member

fixed: #21815

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants