Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question]: sriov CreateEndpoint failure #4

Open
joaomsoares opened this issue Mar 8, 2018 · 13 comments
Open

[question]: sriov CreateEndpoint failure #4

joaomsoares opened this issue Mar 8, 2018 · 13 comments

Comments

@joaomsoares
Copy link

Hi,
I have a Mellanox Innova IPsec card and I am trying to set up docker with SR-IOV.
I managed to start a container via passthrough, however I get the following error when trying to boot one with SR-IOV:

"docker: Error response from daemon: failed to create endpoint kind_heisenberg on network mynet-sriov: NetworkDriver.CreateEndpoint: All devices in use [ f53229e321b1a7fdce364b6e8b7c749f34000b40075cd13839dc7d6eb98326ab ].."

Any help to overcome this would be appreciated. I tried to understand the problem and according to the code it seems to be related to the MAC Address assignment. Below the log of the plugin:

time="2018-03-08T16:57:16Z" level=debug msg="CreateNetwork IPv4Data len : [ 1 ]\n"
time="2018-03-08T16:57:16Z" level=debug msg="parseNetworkGenericOptions map[mode:sriov netdevice:enp4s0]"
max_vfs = 4
cur_vfs = 0
max_vfs = 4
time="2018-03-08T16:57:25Z" level=debug msg="DiscoverVF vfDev list length : [4]"
time="2018-03-08T16:57:25Z" level=debug msg="SRIOV CreateNetwork : [f53229e321b1a7fdce364b6e8b7c749f34000b40075cd13839dc7d6eb98326ab] IPv4Data : [ &{AddressSpace:LocalDefault Pool:194.168.1.0/24 Gateway:194.168.1.1/24 AuxAddresses:map[]} ]\n"
time="2018-03-08T16:57:38Z" level=debug msg="CreateEndpoint Called: [ &{NetworkID:f53229e321b1a7fdce364b6e8b7c749f34000b40075cd13839dc7d6eb98326ab EndpointID:ebb3c7d220ade467b8174e70ebe39232faecb98ce0bee7369e48851896173d5c Interface:0xc4201b20c0 Options:map[com.docker.network.endpoint.exposedports:[] com.docker.network.portmap:[]]} ]"
time="2018-03-08T16:57:38Z" level=debug msg="r.Interface: [ &{Address:194.168.1.2/24 AddressIPv6: MacAddress:} ]"

As well as the output of the "ip link show"

6: enp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 24:8a:07:ad:54:f2 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:22:33:44:55:66, spoof checking off, link-state auto
vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

@paravmellanox
Copy link
Collaborator

Hi,

I need some more information.
Did you assign the mac address using #ip link set vf command before starting the container?
00:22:33:44:55:66 seems human assigned mac address.

Did you start the container with --mac-address= option?
Currently plugin inspects the mac-address of the netdevice of the VF is considered and not the the assigned using ip link set command.

Something looks wrong with rest of the VF mac addresses being zero.

Some notes:
When using the plugin, user should not modify the mac addresses of the VFs anytime.
Plugin does the assignment of mac addresses.
If you wish to pick a specific VF by mac address than, you should do #ip link show.
This will give you all the list of netdevs for the VFs and pick one of the netdev's mac address.

To avoid such hazzle, you can use this support script,
https://github.com/Mellanox/container_scripts/blob/master/docker_sriov_roce_mgmt

Such as below,
docker network create -d passthrough --subnet=194.168.1.0/24 -o netdevice=enp4s0 -o mode=sriov nw1 (you already successfully did this)
Now,
./docker_sriov_roce_mgmt list_netdevs enp4s0

Now that you know the interested netdev to use,
./docker_sriov_roce_mgmt netdev2mac
This give you the mac address of the netdev you want to use.
docker run --mac-address= --net=nw1 <other_options>

Or you can avoid above steps, and use this wrapper,

Now you can do either,
./docker_sriov_roce_mgmt run --netdev= --net=nw1 <other_arguments>

If you are not choosy about which VF to use than you can completely depend on plugin to find free VF for you. In simpler configurations,
you can just do
docker run --net=nw1 <other_options>

@joaomsoares
Copy link
Author

joaomsoares commented Mar 9, 2018

Hi,
Thanks for the super quick answer. But I still can't make it work.
You were right, the first address had been assigned manually via the #ip link set, but this was a mistake of mine. I tried to set it back to 0 (#ip link set ... mac 0), and do #docker run, but the error is still there. In fact, I even removed the VFs, and brought them back up, but still no luck.

docker run --net=mynet-sriov -it a1a3b055c1f9 bin/bash
docker: Error response from daemon: failed to create endpoint relaxed_almeida on network mynet-sriov: NetworkDriver.CreateEndpoint: All devices in use [ e4df4be7ea439460dc16a9952e4b3e508482abd49700603d5ae307da3b918769 ]..

I even tried to run the manual script, and also found another "error", which makes me thing there might be some other issue (?)

./docker_sriov_roce_mgmt list_netdevs enp4s0
list_netdevs enp4s0
ls: cannot access '/sys/class/net/enp4s0/device/virtfn0/net': No such file or directory
ls: cannot access '/sys/class/net/enp4s0/device/virtfn1/net': No such file or directory
ls: cannot access '/sys/class/net/enp4s0/device/virtfn2/net': No such file or directory
ls: cannot access '/sys/class/net/enp4s0/device/virtfn3/net': No such file or directory

Any further hints to overcome this are most welcome!

@paravmellanox
Copy link
Collaborator

Seems like issue that is not related to this plugin.
Can you please share the output of

  1. uname -a
  2. ls -l /sys/class/net/enp4s0/
  3. ls -l /sys/class/net/enp4s0/virtfn0/
  4. ip link show

@joaomsoares
Copy link
Author

Here they are:

  1. uname -a

Linux ct-analytcis-2 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

  1. ls -l /sys/class/net/enp4s0/

total 0
-r--r--r-- 1 root root 4096 Mar 9 16:40 addr_assign_type
-r--r--r-- 1 root root 4096 Mar 9 16:40 address
-r--r--r-- 1 root root 4096 Mar 9 16:40 addr_len
-r--r--r-- 1 root root 4096 Mar 9 16:40 broadcast
-rw-r--r-- 1 root root 4096 Mar 9 16:40 carrier
-r--r--r-- 1 root root 4096 Mar 9 16:40 carrier_changes
drwxr-xr-x 2 root root 0 Mar 9 16:40 debug
lrwxrwxrwx 1 root root 0 Mar 9 16:40 device -> ../../../0000:04:00.0
-r--r--r-- 1 root root 4096 Mar 9 16:40 dev_id
-r--r--r-- 1 root root 4096 Mar 9 16:40 dev_port
-r--r--r-- 1 root root 4096 Mar 9 16:40 dormant
-r--r--r-- 1 root root 4096 Mar 9 16:40 duplex
drwxr-xr-x 4 root root 0 Mar 9 16:40 ecn
-rw-r--r-- 1 root root 4096 Mar 9 16:40 flags
-rw-r--r-- 1 root root 4096 Mar 9 16:40 gro_flush_timeout
-rw-r--r-- 1 root root 4096 Mar 9 16:40 ifalias
-r--r--r-- 1 root root 4096 Mar 9 16:40 ifindex
-r--r--r-- 1 root root 4096 Mar 9 16:40 iflink
-r--r--r-- 1 root root 4096 Mar 9 16:40 link_mode
-rw-r--r-- 1 root root 4096 Mar 9 16:40 mtu
-r--r--r-- 1 root root 4096 Mar 9 16:40 name_assign_type
-rw-r--r-- 1 root root 4096 Mar 9 16:40 netdev_group
-r--r--r-- 1 root root 4096 Mar 9 16:40 operstate
-r--r--r-- 1 root root 4096 Mar 9 16:40 phys_port_id
-r--r--r-- 1 root root 4096 Mar 9 16:40 phys_port_name
-r--r--r-- 1 root root 4096 Mar 9 16:40 phys_switch_id
drwxr-xr-x 2 root root 0 Mar 9 16:40 power
-rw-r--r-- 1 root root 4096 Mar 9 16:40 proto_down
drwxr-xr-x 2 root root 0 Mar 9 16:40 qos
drwxr-xr-x 66 root root 0 Mar 9 16:40 queues
drwxr-xr-x 2 root root 0 Mar 9 16:40 settings
-r--r--r-- 1 root root 4096 Mar 9 16:40 speed
drwxr-xr-x 2 root root 0 Mar 9 16:40 statistics
lrwxrwxrwx 1 root root 0 Mar 9 16:40 subsystem -> ../../../../../../class/net
-rw-r--r-- 1 root root 4096 Mar 9 16:40 tx_queue_len
-r--r--r-- 1 root root 4096 Mar 9 16:40 type
-rw-r--r-- 1 root root 4096 Mar 9 16:40 uevent

  1. I assume you meant ls -l /sys/class/net/enp4s0/device/virtfn0/
    total 0

-rw-r--r-- 1 root root 4096 Mar 9 16:42 broken_parity_status
-r--r--r-- 1 root root 4096 Mar 9 16:42 class
-rw-r--r-- 1 root root 4096 Mar 9 16:42 config
-r--r--r-- 1 root root 4096 Mar 9 16:42 consistent_dma_mask_bits
-rw-r--r-- 1 root root 4096 Mar 9 16:42 d3cold_allowed
-r--r--r-- 1 root root 4096 Mar 9 16:42 device
-r--r--r-- 1 root root 4096 Mar 9 16:42 dma_mask_bits
-rw-r--r-- 1 root root 4096 Mar 9 16:42 driver_override
-rw-r--r-- 1 root root 4096 Mar 9 16:42 enable
-r--r--r-- 1 root root 4096 Mar 9 16:42 irq
-r--r--r-- 1 root root 4096 Mar 9 16:42 local_cpulist
-r--r--r-- 1 root root 4096 Mar 9 16:42 local_cpus
-r--r--r-- 1 root root 4096 Mar 9 16:42 modalias
-rw-r--r-- 1 root root 4096 Mar 9 16:42 msi_bus
-rw-r--r-- 1 root root 4096 Mar 9 16:42 numa_node
lrwxrwxrwx 1 root root 0 Mar 9 16:42 physfn -> ../0000:04:00.0
drwxr-xr-x 2 root root 0 Mar 9 16:42 power
--w------- 1 root root 4096 Mar 9 16:42 reset
-r--r--r-- 1 root root 4096 Mar 9 16:42 resource
-rw------- 1 root root 2097152 Mar 9 16:42 resource0
-rw------- 1 root root 2097152 Mar 9 16:42 resource0_wc
lrwxrwxrwx 1 root root 0 Mar 9 11:08 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root 4096 Mar 9 16:42 subsystem_device
-r--r--r-- 1 root root 4096 Mar 9 16:42 subsystem_vendor
-rw-r--r-- 1 root root 4096 Mar 9 11:08 uevent
-r--r--r-- 1 root root 4096 Mar 9 11:08 vendor
-rw------- 1 root root 32768 Mar 9 16:42 vpd

  1. ip link show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 54:9f:35:20:8f:f8 brd ff:ff:ff:ff:ff:ff
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 54:9f:35:20:8f:f9 brd ff:ff:ff:ff:ff:ff
4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 54:9f:35:20:8f:fa brd ff:ff:ff:ff:ff:ff
5: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 54:9f:35:20:8f:fb brd ff:ff:ff:ff:ff:ff
6: enp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 24:8a:07:ad:54:f2 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
7: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:e4:81:98:33 brd ff:ff:ff:ff:ff:ff

@paravmellanox
Copy link
Collaborator

From output of command 4, it appears that netdevices for the VF are not created for some reason. I suggest you that you talk to Mellanox tech support first to see that these netdevices are seen.
You should share /var/log/messages along with
output of
ls -l /sys/class/net/enp4s0/device/

@joaomsoares
Copy link
Author

Thanks for the reply. You mean output of command 4 or command 3? What should be the expected outcome of the command?
In the meantime I'll reach out to Mellanox tech support as well.

@paravmellanox
Copy link
Collaborator

4th command - ip link show
This needs to show list of netdevices which belong to this VFs.
Sometime ufio driver takes over the VFs if there is past KVM setup/configuration exist. In that case netdevices may not be created.
So let us first that netdevices of the VFs are created. If you can share /var/log/messages, it will give some quick hint.

@paravmellanox
Copy link
Collaborator

if you share the output (pretty long) of lspci -vvv it will reflect which driver (mlx5_core) or vfio driver owns the VFs that might throw light on why netdevices are not created.

@joaomsoares
Copy link
Author

trying to short the output (include the native card and one VF - seems mlx5_core owns both):

04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 66
Region 0: Memory at 33ffc000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: Innova IPsec 4 Lx EN Adapter, single-port QSFP, 10/40GbE, PCIe3.0 x8, HHHL, tall bracket, ROHS R6
Read-only fields:
[PN] Part number: MNV101511A-BCIT
[EC] Engineering changes: A6
[V2] Vendor specific: MNV101511A-BCIT
[SN] Serial number: MT1712X01617
[V3] Vendor specific: bef70ecc3f0fe7118000248a07ad54f2
[VA] Vendor specific: MLX:MODL=CX4732A:MN=MLNX:CSKU=V2:UUID=V3:PCI=0
[V0] Vendor specific: PCIeGen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+
IOVSta: Migration-
Initial VFs: 4, Total VFs: 4, Number of VFs: 4, Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 1016
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 0000033ffe000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] #19
Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Region 0: [virtual] Memory at 33ffe000000 (64-bit, prefetchable) [size=2M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: Innova IPsec 4 Lx EN Adapter, single-port QSFP, 10/40GbE, PCIe3.0 x8, HHHL, tall bracket, ROHS R6
Read-only fields:
[PN] Part number: MNV101511A-BCIT
[EC] Engineering changes: A6
[V2] Vendor specific: MNV101511A-BCIT
[SN] Serial number: MT1712X01617
[V3] Vendor specific: bef70ecc3f0fe7118000248a07ad54f2
[VA] Vendor specific: MLX:MODL=CX4732A:MN=MLNX:CSKU=V2:UUID=V3:PCI=0
[V0] Vendor specific: PCIeGen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable- Count=12 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Kernel modules: mlx5_core

@joaomsoares
Copy link
Author

as to the /var/log/messages ... which one are we talking about?

ls -l /var/log/
total 42472
-rw-r--r-- 1 root root 0 Feb 1 07:35 alternatives.log
-rw-r--r-- 1 root root 49904 Jan 31 15:22 alternatives.log.1
-rw-r--r-- 1 root root 1731 Jun 1 2015 alternatives.log.2.gz
-rw-r--r-- 1 root root 3401 May 27 2015 alternatives.log.3.gz
-rw-r----- 1 root adm 0 Mar 10 07:35 apport.log
-rw-r----- 1 root adm 113 Mar 9 11:05 apport.log.1
-rw-r----- 1 root adm 354 Jul 2 2015 apport.log.2.gz
-rw-r----- 1 root adm 341 Jun 23 2015 apport.log.3.gz
-rw-r----- 1 root adm 305 Jun 17 2015 apport.log.4.gz
-rw-r----- 1 root adm 339 Jun 16 2015 apport.log.5.gz
-rw-r----- 1 root adm 270 Jun 15 2015 apport.log.6.gz
-rw-r----- 1 root adm 430 Jun 3 2015 apport.log.7.gz
drwxr-xr-x 2 root root 4096 Mar 1 07:35 apt
-rw-r----- 1 syslog adm 91118 Mar 14 18:17 auth.log
-rw-r----- 1 syslog adm 97196 Mar 11 07:30 auth.log.1
-rw-r----- 1 syslog adm 8516 Mar 5 07:30 auth.log.2.gz
-rw-r----- 1 syslog adm 1662 Feb 25 07:30 auth.log.3.gz
-rw-r----- 1 syslog adm 2167 Feb 19 07:30 auth.log.4.gz
-rw-r--r-- 1 root root 141 Mar 14 18:03 boot.log
-rw-r--r-- 1 root root 61499 Feb 18 2015 bootstrap.log
-rw------- 1 root utmp 4992 Mar 14 17:49 btmp
-rw-rw---- 1 root utmp 768 Feb 27 12:29 btmp.1
drwxr-xr-x 2 root root 4096 Mar 14 18:35 containers
drwxr-xr-x 2 root root 4096 Mar 14 07:35 cups
drwxr-xr-x 3 root root 4096 Jan 31 11:37 dist-upgrade
-rw-r----- 1 root adm 107486 Jan 31 10:52 dmesg
-rw-r----- 1 root adm 109481 Jan 31 09:37 dmesg.0
-rw-r----- 1 root adm 20574 Dec 16 13:52 dmesg.1.gz
-rw-r----- 1 root adm 20393 Sep 21 09:13 dmesg.2.gz
-rw-r----- 1 root adm 20927 Sep 4 2017 dmesg.3.gz
-rw-r----- 1 root adm 20854 Mar 2 2017 dmesg.4.gz
-rw-r--r-- 1 root root 507015 Mar 13 11:47 dpkg.log
-rw-r--r-- 1 root root 12259 Feb 27 14:46 dpkg.log.1
-rw-r--r-- 1 root root 216904 Jan 31 15:27 dpkg.log.2.gz
-rw-r--r-- 1 root root 431 Sep 17 2015 dpkg.log.3.gz
-rw-r--r-- 1 root root 17023 Jun 1 2015 dpkg.log.4.gz
-rw-r--r-- 1 root root 117158 May 27 2015 dpkg.log.5.gz
-rw-r--r-- 1 root root 32288 Jan 31 11:28 faillog
-rw-r--r-- 1 root root 4303 Jan 31 11:37 fontconfig.log
drwxr-xr-x 2 root root 4096 Feb 18 2015 fsck
-rw-r--r-- 1 root root 1860 Mar 14 18:03 gpu-manager.log
drwxr-xr-x 3 root root 4096 Feb 18 2015 hp
drwxrwxr-x 2 root root 4096 May 26 2015 installer
-rw-r----- 1 syslog adm 2518583 Mar 14 18:35 kern.log
-rw-r----- 1 syslog adm 2399042 Mar 11 07:33 kern.log.1
-rw-r----- 1 syslog adm 101856 Mar 4 07:29 kern.log.2.gz
-rw-r----- 1 syslog adm 2766 Feb 27 15:00 kern.log.3.gz
-rw-r----- 1 syslog adm 900 Feb 22 15:39 kern.log.4.gz
-rw-rw-r-- 1 root utmp 294628 Mar 14 18:04 lastlog
drwxr-xr-x 2 root root 4096 Mar 14 07:35 lightdm
-rw-r--r-- 1 root root 0 Feb 1 07:35 pm-powersave.log
-rw-r--r-- 1 root root 16078 Jan 31 10:52 pm-powersave.log.1
-rw-r--r-- 1 root root 870 Dec 16 13:52 pm-powersave.log.2.gz
-rw-r--r-- 1 root root 870 Sep 21 09:13 pm-powersave.log.3.gz
-rw-r--r-- 1 root root 841 Sep 4 2017 pm-powersave.log.4.gz
drwxr-xr-x 9 root root 4096 Mar 14 16:43 pods
drwxr-xr-x 2 root root 4096 Jun 1 2015 rstudio-server
drwxr-x--- 2 root adm 4096 Jan 29 2015 samba
drwx------ 2 speech-dispatcher root 4096 Feb 19 2014 speech-dispatcher
-rw-r----- 1 syslog adm 10109449 Mar 14 18:36 syslog
-rw-r----- 1 syslog adm 20227869 Mar 14 07:35 syslog.1
-rw-r----- 1 syslog adm 1032663 Mar 13 07:35 syslog.2.gz
-rw-r----- 1 syslog adm 868848 Mar 12 07:35 syslog.3.gz
-rw-r----- 1 syslog adm 872004 Mar 11 07:35 syslog.4.gz
-rw-r----- 1 syslog adm 1055977 Mar 10 07:35 syslog.5.gz
-rw-r----- 1 syslog adm 947589 Mar 9 07:35 syslog.6.gz
-rw-r----- 1 syslog adm 765619 Mar 8 07:35 syslog.7.gz
-rw-r--r-- 1 root root 631464 Jan 31 10:51 udev
drwxr-x--- 2 root adm 4096 May 26 2015 unattended-upgrades
drwxr-xr-x 2 root root 12288 Feb 2 07:35 upstart
-rw-rw-r-- 1 root utmp 95616 Mar 14 18:04 wtmp
-rw-rw-r-- 1 root utmp 5376 Feb 27 18:30 wtmp.1
-rw-r--r-- 1 root root 24489 Mar 14 18:03 Xorg.0.log
-rw-r--r-- 1 root root 24721 Mar 14 17:53 Xorg.0.log.old

@paravmellanox
Copy link
Collaborator

/var/log/syslog and /var/log/dmesg should have some driver failure logs for the VFs.

@joaomsoares
Copy link
Author

Right...dmesg shows some errors:

Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.310139] (0000:04:00.0): E-Switch: E-Switch enable SRIOV: nvfs(4) mode (1)
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.458403] (0000:04:00.0): E-Switch: SRIOV enabled: active vports(5)
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.562856] pci 0000:04:00.1: [15b3:1016] type 00 class 0x020000
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.563383] pci 0000:04:00.1: Max Payload Size set to 256 (was 128, max 512)
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.563963] iommu: Adding device 0000:04:00.1 to group 48
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564165] mlx5_core 0000:04:00.1: enabling device (0000 -> 0002)
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564634] mlx5_core 0000:04:00.1: firmware version: 14.98.3410
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564674] mlx5_core 0000:04:00.1: mlx5_pcie_print_link_status:411:(pid 143482): PCIe width is lower than device's capability
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564678] mlx5_core 0000:04:00.1: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564681] mlx5_core 0000:04:00.1: PCIe link width is x0, device supports x8
Mar 15 14:21:58 ct-analytcis-2 kernel: [73144.564751] DMAR: 64bit 0000:04:00.1 uses identity mapping
Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.237041] mlx5_core 0000:04:00.1: mlx5_cmd_check:731:(pid 143482): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3), syndrome (0x5a98c0)
Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.237048] mlx5_core 0000:04:00.1: FPGA: mlx5_fpga_device_load_check:152:(pid 143482): Failed to query status: -22
Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.237051] mlx5_core 0000:04:00.1: fpga device start failed -22
Mar 15 14:21:58 ct-analytcis-2 kernel: [73145.259140] mlx5_core 0000:04:00.1: tools char device 243:2 destroyed
Mar 15 14:21:59 ct-analytcis-2 kernel: [73145.637372] mlx5_core 0000:04:00.1: mlx5_load_one failed with error code -22
Mar 15 14:21:59 ct-analytcis-2 kernel: [73145.637538] mlx5_core: probe of 0000:04:00.1 failed with error -22

@paravmellanox
Copy link
Collaborator

Now it make sense. It seems like driver fail to load on VF with given error. This is helpful. I suggest you please contact the tech support to get this error resolved without bringing any plugin/container things in picture to get faster results.
Once that is done, it is likely that plugin will work. I do not have access to Innova cards; This piece of software error is not in my domain.

I will add more check at plugin level to make sure that network creation fails if it encounters this kind of unexpected error (instead of failing at container creation time).
Thanks for the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants