Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18755

landing a LU-17480 don't fixes an LBUG in CM event handler.

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.16.0, Lustre 2.15.0
    • None
    • RHEL 9.5 + MOFED
    • 3
    • 9223372036854775807

    Description

      in testing on IB fabric client hits a panic.

      [36321.312999] LNetError: 1098489:0:(o2iblnd_cb.c:3312:kiblnd_cm_callback()) LBUG
      [36321.313013] Pid: 1098489, comm: kworker/15:1 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024
      [36321.313019] Call Trace TBD:
      [36321.313022] Kernel panic - not syncing: LBUG
      [36321.341599] CPU: 15 PID: 1098489 Comm: kworker/15:1 Kdump: loaded Tainted: G S         OE     -------  ---  5.14.0-503.14.1.el9_5.x86_64 #1
      [36321.355581] Hardware name: Intel Corporation S2600JF/S2600JF, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
      [36321.367042] Workqueue: ib_cm cm_work_handler [ib_cm]
      [36321.372616] Call Trace:
      [36321.375349]  <TASK>
      [36321.377695]  dump_stack_lvl+0x34/0x48
      [36321.381795]  panic+0x107/0x2bb
      [36321.385205]  lbug_with_loc.cold+0x18/0x18 [libcfs]
      [36321.390587]  kiblnd_cm_callback+0x1305/0x1310 [ko2iblnd]
      [36321.396553]  cma_cm_event_handler+0x1e/0xd0 [rdma_cm]
      [36321.402211]  cma_ib_handler+0x8d/0x2f0 [rdma_cm]
      [36321.407383]  cm_process_work+0x25/0x1a0 [ib_cm]
      [36321.412459]  ? cm_queue_work_unlock+0x2f/0xd0 [ib_cm]
      [36321.418121]  cm_rej_handler+0xe5/0x290 [ib_cm]
      [36321.423106]  cm_work_handler+0x493/0x500 [ib_cm]
      [36321.428280]  process_one_work+0x194/0x380
      [36321.432766]  worker_thread+0x2fe/0x410
      [36321.436960]  ? __pfx_worker_thread+0x10/0x10
      [36321.441729]  kthread+0xdd/0x100
      [36321.445238]  ? __pfx_kthread+0x10/0x10
      [36321.449426]  ret_from_fork+0x29/0x50
      [36321.453424]  </TASK>
      

      looking into crash - I see a o2ib connection in the disconnected state, but IB stack see it's in the connecting state.
      it's looks like this is bug in the LU-17480 in the timeout handling.
      on connection timeout - it called an

      kiblnd_abort_connreq(struct kib_conn *conn)
      {
              /* ignore, if already handled by the CM */
              if (kiblnd_deregister_connreq(conn))
                      return;
      
              kiblnd_connreq_done(conn, -ENETDOWN);
              kiblnd_conn_decref(conn);
      }
      

      which moves an o2ib connection in disconnect state but rdma_disconnect don't called an CM connection don't closed. It cased a situation when peer might respond with RDMA_CM_EVENT_REJECTED in window when kiblnd change state to disconnected via path

      kiblnd_abort_connreq -> 
        kiblnd_connreq_done -> 
         kiblnd_finalise_conn -> 
           kiblnd_set_conn_state(conn, IBLND_CONN_DISCONNECTED);  
      

      but CM connection isn't closed. It caused an LBUG hits.

      Attachments

        Issue Links

          Activity

            People

              shadow Alexey Lyashkov
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: