Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14488

Support rdma_connect_locked()

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.12.7, Lustre 2.15.0
    • None
    • MOFED-5.2-2.2.0.0
    • 9223372036854775807

    Description

      Hi,

      I'm testing the Lustre master branch with MOFED-5.2-2.2.0.0. I get the following error at mounting Lustre on the client:

       

      [Thu Mar  4 11:15:48 2021] INFO: task kworker/u8:2:10042 blocked for more than 120 seconds.

      [Thu Mar  4 11:15:48 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      [Thu Mar  4 11:15:48 2021] kworker/u8:2    D ffff8895368e0000     0 10042      2 0x00000080

      [Thu Mar  4 11:15:48 2021] Workqueue: rdma_cm cma_work_handler [rdma_cm]

      [Thu Mar  4 11:15:48 2021] Call Trace:

      [Thu Mar  4 11:15:48 2021]  [<ffffffff86786ca9>] schedule_preempt_disabled+0x29/0x70

      [Thu Mar  4 11:15:48 2021]  [<ffffffff86784c37>] __mutex_lock_slowpath+0xc7/0x1d0

      [Thu Mar  4 11:15:48 2021]  [<ffffffff8678400f>] mutex_lock+0x1f/0x2f

      [Thu Mar  4 11:15:48 2021]  [<ffffffffc054e5d3>] rdma_connect+0x23/0x50 [rdma_cm]

      [Thu Mar  4 11:15:48 2021]  [<ffffffffc0971105>] kiblnd_cm_callback+0x1575/0x23d0 [ko2iblnd]

      [Thu Mar  4 11:15:48 2021]  [<ffffffffc054ebd1>] cma_work_handler+0xa1/0xe0 [rdma_cm]

      [Thu Mar  4 11:15:48 2021]  [<ffffffff860be6bf>] process_one_work+0x17f/0x440

      [Thu Mar  4 11:15:48 2021]  [<ffffffff860bf7d6>] worker_thread+0x126/0x3c0

      [Thu Mar  4 11:15:48 2021]  [<ffffffff860bf6b0>] ? manage_workers.isra.26+0x2a0/0x2a0

      [Thu Mar  4 11:15:48 2021]  [<ffffffff860c6691>] kthread+0xd1/0xe0

      [Thu Mar  4 11:15:48 2021]  [<ffffffff860c65c0>] ? insert_kthread_work+0x40/0x40

      [Thu Mar  4 11:15:48 2021]  [<ffffffff86792d37>] ret_from_fork_nospec_begin+0x21/0x21

      [Thu Mar  4 11:15:48 2021]  [<ffffffff860c65c0>] ? insert_kthread_work+0x40/0x40

       

      I investigated the issue and found out the issue is related to the change that became to MOFED from the upstream kernel 5.10:

      https://www.spinics.net/lists/linux-rdma/msg96986.html

       

      After the patch, it is not allowed to call rdma_connect() in RDMA_CM_EVENT_ROUTE_RESOLVED handler; rdma_connect_locked() must be used instead.

      I'm testing a patch for the issue. I'm going to push it for review soon.

      Attachments

        Issue Links

          Activity

            People

              sergeygo Sergey Gorenko
              sergeygo Sergey Gorenko
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: