Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io-engine crashes - ERROR metrics_exporter_io_engine::client::grpc_client: Grpc connection timeout, retrying after 10s #1763

Open
innotecsol opened this issue Oct 27, 2024 · 1 comment

Comments

@innotecsol
Copy link

Describe the bug
I am trying to upgrade the underlying linux version from talos 1.6.7 to 1.7.7.
After upgrading one node to the new talos version the io-engine goes in a CrashLoopBackoff with the error

2024-10-27T12:29:59.854755Z ERROR metrics_exporter_io_engine::client::grpc_client: Grpc connection timeout, retrying after 10s
    at metrics-exporter/src/bin/io_engine/client/grpc_client.rs:86

As read in openebs/openebs#3769 I tried to disabled the metrics

helm upgrade mayastor mayastor/mayastor -n mayastor --reuse-values --version 2.7.1 --set base.metrics.enabled=true

After restarting the pod I get another error:

kubectl logs -n mayastor             mayastor-io-engine-xrdsg
Defaulted container "io-engine" out of: io-engine, agent-core-grpc-probe (init), etcd-probe (init)
[2024-10-27T12:40:47.388554894+00:00  INFO io_engine:io-engine.rs:253] Engine responsible for managing I/Os version 1.0.0, revision 40572c92e3d5 (v2.7.1+0)
[2024-10-27T12:40:47.388692355+00:00  WARN io_engine:io-engine.rs:172] Failed to read the number of pages at /sys/kernel/mm/hugepages/hugepages-1048576kB error=No such file or directory (os error 2)
[2024-10-27T12:40:47.388718884+00:00  WARN io_engine:io-engine.rs:189] Failed to read the number of free pages at /sys/kernel/mm/hugepages/hugepages-1048576kB error=No such file or directory (os error 2)
[2024-10-27T12:40:47.388730070+00:00  INFO io_engine:io-engine.rs:232] free_pages 2MB: 2048 nr_pages 2MB: 2048
[2024-10-27T12:40:47.388736423+00:00  INFO io_engine:io-engine.rs:233] free_pages 1GB: 0 nr_pages 1GB: 0
[2024-10-27T12:40:47.388840199+00:00  INFO io_engine:io-engine.rs:285] kernel io_uring support: yes
[2024-10-27T12:40:47.388850902+00:00  INFO io_engine:io-engine.rs:289] kernel nvme initiator multipath support: disabled
[2024-10-27T12:40:47.388875069+00:00  INFO io_engine::core::env:env.rs:945] loading mayastor config YAML file /var/local/mayastor/io-engine/config.yaml
[2024-10-27T12:40:47.388900357+00:00  INFO io_engine::subsys::config:mod.rs:189] Config file /var/local/mayastor/io-engine/config.yaml is empty, reverting to default config
[2024-10-27T12:40:47.388915294+00:00  INFO io_engine::subsys::config::opts:opts.rs:169] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2024-10-27T12:40:47.388922562+00:00  INFO io_engine::subsys::config::opts:opts.rs:169] Overriding NVMF_TCP_MAX_QPAIRS_PER_CTRL value to '32'
[2024-10-27T12:40:47.388932887+00:00  INFO io_engine::subsys::config::opts:opts.rs:236] Overriding NVME_TIMEOUT value to '110s'
[2024-10-27T12:40:47.388939737+00:00  INFO io_engine::subsys::config::opts:opts.rs:236] Overriding NVME_TIMEOUT_ADMIN value to '30s'
[2024-10-27T12:40:47.388945909+00:00  INFO io_engine::subsys::config::opts:opts.rs:236] Overriding NVME_KATO value to '10s'
[2024-10-27T12:40:47.388960997+00:00  INFO io_engine::subsys::config:mod.rs:240] Applying Mayastor configuration settings
[2024-10-27T12:40:47.388972279+00:00  INFO io_engine::subsys::config::opts:opts.rs:363] NVMe Bdev options successfully applied
[2024-10-27T12:40:47.388979464+00:00  INFO io_engine::subsys::config::opts:opts.rs:500] Bdev options successfully applied
[2024-10-27T12:40:47.388986860+00:00  INFO io_engine::subsys::config::opts:opts.rs:637] Socket options successfully applied
[2024-10-27T12:40:47.388992751+00:00  INFO io_engine::subsys::config::opts:opts.rs:668] I/O buffer options successfully applied
[2024-10-27T12:40:47.388997934+00:00  INFO io_engine::subsys::config:mod.rs:246] Config {
    source: Some(
        "/var/local/mayastor/io-engine/config.yaml",
    ),
    nvmf_tcp_tgt_conf: NvmfTgtConfig {
        name: "mayastor_target",
        max_namespaces: 2048,
        crdt: [
            30,
            0,
            0,
        ],
        opts: NvmfTcpTransportOpts {
            max_queue_depth: 32,
            max_qpairs_per_ctrl: 32,
            in_capsule_data_size: 4096,
            max_io_size: 131072,
            io_unit_size: 131072,
            max_aq_depth: 32,
            num_shared_buf: 2047,
            buf_cache_size: 64,
            dif_insert_or_strip: false,
            abort_timeout_sec: 1,
            acceptor_poll_rate: 10000,
            zcopy: true,
        },
        interface: None,
    },
    nvme_bdev_opts: NvmeBdevOpts {
        action_on_timeout: 4,
        timeout_us: 110000000,
        timeout_admin_us: 30000000,
        keep_alive_timeout_ms: 10000,
        transport_retry_count: 0,
        arbitration_burst: 0,
        low_priority_weight: 0,
        medium_priority_weight: 0,
        high_priority_weight: 0,
        nvme_adminq_poll_period_us: 1000,
        nvme_ioq_poll_period_us: 0,
        io_queue_requests: 0,
        delay_cmd_submit: true,
        bdev_retry_count: 0,
        transport_ack_timeout: 0,
        ctrlr_loss_timeout_sec: 0,
        reconnect_delay_sec: 0,
        fast_io_fail_timeout_sec: 0,
        disable_auto_failback: false,
        generate_uuids: true,
    },
    bdev_opts: BdevOpts {
        bdev_io_pool_size: 65535,
        bdev_io_cache_size: 512,
    },
    nexus_opts: NexusOpts {
        nvmf_enable: true,
        nvmf_discovery_enable: true,
        nvmf_nexus_port: 4421,
        nvmf_replica_port: 8420,
    },
    socket_opts: PosixSocketOpts {
        recv_buf_size: 2097152,
        send_buf_size: 2097152,
        enable_recv_pipe: true,
        enable_zero_copy_send: true,
        enable_quickack: true,
        enable_placement_id: 0,
        enable_zerocopy_send_server: true,
        enable_zerocopy_send_client: false,
        zerocopy_threshold: 0,
    },
    iobuf_opts: IoBufOpts {
        small_pool_count: 8192,
        large_pool_count: 2048,
        small_bufsize: 8192,
        large_bufsize: 135168,
    },
    eal_opts: EalOpts {
        reactor_mask: None,
        core_list: None,
        developer_delay: None,
    },
}
EAL: alloc_pages_on_heap(): couldn't allocate memory due to IOVA exceeding limits of current DMA mask
EAL: alloc_pages_on_heap(): Please try initializing EAL with --iova-mode=pa parameter
EAL: error allocating rte services array
EAL: FATAL: rte_service_init() failed
EAL: rte_service_init() failed
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:786:13
stack backtrace:
   0: std::panicking::begin_panic
   1: io_engine::core::env::MayastorEnvironment::init
   2: io_engine::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I am not sure whether the first error regarding metrics exporter is just hiding the later.

To Reproduce
Upgrading talos to version from 1.6.7 to 1.7.7
Talos 1.7.7 starts up and all other pods are running fine:

talos1.7.7.log

Expected behavior
io-engine pod to start up

** OS info (please complete the following information):**

  • Distro:
    Talos 1.7.7
  • Kernel version
    Linux version 6.6.52-talos (@buildkitsandbox) (gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.42) Reserve space for metadata #1 SMP Tue Sep 24 15:57:34 UTC 2024
  • MayaStor revision or container image
    2.7.1

Additional context

@tiagolobocastro
Copy link
Contributor

hmm have you modified the hugepages recently?
May I suggest deleting the io-engine pod which is hitting this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants