Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18260

o2iblnd: graceful handling of unexpected RDMA_CM_EVENT_REJECTED

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.15.5
    • Lustre server 2.15.5 RoCE
      Lustre MGS 2.15.5 RoCE
      Lustre client 2.15.5 RoCE
    • 3
    • 9223372036854775807

    Description

      Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

      【OS】
      VM Version: qemu-kvm-7.0.0
      OS Verion: Rocky 8.10
      Kernel Verion: 4.18.0-553.el8_10.x86_64

      【Network Card】
      Client:
      MLX CX6 1*100G RoCE v2
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      Server:
      MLX CX6 2*100G RoCE v2 bond
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      【BUG Info】

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Construct Luster server restart
      • Crash occurs on the client

      Client call trace:

      [97665.762774] Workqueue: ib_cm cm_work_handler [ib_cm]
      [97665.762977] Call Trace:
      [97665.763182]  <TASK>
      [97665.763374]  dump_stack_lvl+0x34/0x48
      [97665.763586]  panic+0x100/0x2d2
      [97665.763756]  lbug_with_loc.cold+0x18/0x18 [libcfs]
      [97665.763945]  kiblnd_cm_callback+0x108d/0x10b0 [ko2iblnd]
      [97665.764116]  cma_cm_event_handler+0x1e/0xb0 [rdma_cm]
      [97665.764279]  cma_ib_handler+0x8d/0x2e0 [rdma_cm]
      [97665.764439]  cm_process_work+0x22/0x190 [ib_cm]
      [97665.764597]  ? cm_queue_work_unlock+0x2a/0xd0 [ib_cm]
      [97665.764751]  cm_rej_handler+0xdf/0x260 [ib_cm]
      [97665.764909]  cm_work_handler+0x47f/0x4d0 [ib_cm]
      [97665.765059]  process_one_work+0x1e8/0x390
      [97665.765203]  worker_thread+0x53/0x3d0
      [97665.765350]  ? process_one_work+0x390/0x390
      [97665.765488]  kthread+0x124/0x150
      [97665.765626]  ? set_kthread_struct+0x50/0x50
      [97665.765761]  ret_from_fork+0x1f/0x30
      [97665.765901]  </TASK>

      Client kernel log:

      [94741.917887] Lustre: 3442:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1726811270/real 1726811270]  req@0000000017c656bd x1810605440940608/t0(0) o400->lustre-MDT0011-mdc-ff395c3864984000@10.255.153.131@o2ib:12/10 lens 224/224 e 1 to 1 dl 1726811302 ref 1 fl Rpc:XQr/c0/ffffffff rc 0/-1 job:''
      [94869.917879] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.124@o2ib: 3 seconds
      [94869.918229] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 195 previous similar messages
      [95091.934608] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30
      [95091.935234] LustreError: Skipped 178 previous similar messages
      [95471.901885] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.123@o2ib: 1 seconds
      [95471.904650] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 205 previous similar messages
      [95694.879000] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30
      [95694.879355] LustreError: Skipped 173 previous similar messages
      [96075.869883] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.123@o2ib: 0 seconds
      [96075.870246] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 198 previous similar messages
      [96294.942490] LustreError: 11-0: lustre-OST0022-osc-ff395c3864984000: operation ost_connect to node 10.255.153.128@o2ib failed: rc = -19
      [96294.942795] LustreError: Skipped 175 previous similar messages
      [96681.885884] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.124@o2ib: 1 seconds
      [96681.886250] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 197 previous similar messages
      [96899.871468] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30
      [96899.871784] LustreError: Skipped 176 previous similar messages
      [97283.869887] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.123@o2ib: 1 seconds
      [97283.870258] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 199 previous similar messages
      [97501.918467] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30

      Attachments

        Issue Links

          Activity

            People

              xiyan Rongyao Peng
              xiyan Rongyao Peng
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: