[LU-14488] Support rdma_connect_locked() Created: 04/Mar/21  Updated: 27/Apr/21  Resolved: 09/Mar/21

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.7, Lustre 2.15.0

Type: Improvement Priority: Major
Reporter: Sergey Gorenko Assignee: Sergey Gorenko
Resolution: Fixed Votes: 0
Labels: LTS12
Environment:

MOFED-5.2-2.2.0.0


Issue Links:
Related
is related to LU-14588 LNet: make config script aware of the... Resolved
Epic/Theme: lnet
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

I'm testing the Lustre master branch with MOFED-5.2-2.2.0.0. I get the following error at mounting Lustre on the client:

 

[Thu Mar  4 11:15:48 2021] INFO: task kworker/u8:2:10042 blocked for more than 120 seconds.

[Thu Mar  4 11:15:48 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[Thu Mar  4 11:15:48 2021] kworker/u8:2    D ffff8895368e0000     0 10042      2 0x00000080

[Thu Mar  4 11:15:48 2021] Workqueue: rdma_cm cma_work_handler [rdma_cm]

[Thu Mar  4 11:15:48 2021] Call Trace:

[Thu Mar  4 11:15:48 2021]  [<ffffffff86786ca9>] schedule_preempt_disabled+0x29/0x70

[Thu Mar  4 11:15:48 2021]  [<ffffffff86784c37>] __mutex_lock_slowpath+0xc7/0x1d0

[Thu Mar  4 11:15:48 2021]  [<ffffffff8678400f>] mutex_lock+0x1f/0x2f

[Thu Mar  4 11:15:48 2021]  [<ffffffffc054e5d3>] rdma_connect+0x23/0x50 [rdma_cm]

[Thu Mar  4 11:15:48 2021]  [<ffffffffc0971105>] kiblnd_cm_callback+0x1575/0x23d0 [ko2iblnd]

[Thu Mar  4 11:15:48 2021]  [<ffffffffc054ebd1>] cma_work_handler+0xa1/0xe0 [rdma_cm]

[Thu Mar  4 11:15:48 2021]  [<ffffffff860be6bf>] process_one_work+0x17f/0x440

[Thu Mar  4 11:15:48 2021]  [<ffffffff860bf7d6>] worker_thread+0x126/0x3c0

[Thu Mar  4 11:15:48 2021]  [<ffffffff860bf6b0>] ? manage_workers.isra.26+0x2a0/0x2a0

[Thu Mar  4 11:15:48 2021]  [<ffffffff860c6691>] kthread+0xd1/0xe0

[Thu Mar  4 11:15:48 2021]  [<ffffffff860c65c0>] ? insert_kthread_work+0x40/0x40

[Thu Mar  4 11:15:48 2021]  [<ffffffff86792d37>] ret_from_fork_nospec_begin+0x21/0x21

[Thu Mar  4 11:15:48 2021]  [<ffffffff860c65c0>] ? insert_kthread_work+0x40/0x40

 

I investigated the issue and found out the issue is related to the change that became to MOFED from the upstream kernel 5.10:

https://www.spinics.net/lists/linux-rdma/msg96986.html

 

After the patch, it is not allowed to call rdma_connect() in RDMA_CM_EVENT_ROUTE_RESOLVED handler; rdma_connect_locked() must be used instead.

I'm testing a patch for the issue. I'm going to push it for review soon.



 Comments   
Comment by Gerrit Updater [ 04/Mar/21 ]

Sergey Gorenko (sergeygo@nvidia.com) uploaded a new patch: https://review.whamcloud.com/41887
Subject: LU-14488 o2ib: Use rdma_connect_locked if it is defined
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 937f7433cd80e72bee637eeb9647fb2739f67554

Comment by Gerrit Updater [ 09/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41887/
Subject: LU-14488 o2ib: Use rdma_connect_locked if it is defined
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 60d55e42ed9e043341790bf7624627c93cc99200

Comment by Peter Jones [ 09/Mar/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 10/Mar/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41977
Subject: LU-14488 o2ib: Use rdma_connect_locked if it is defined
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 687c6a882282afbf0e226440b80251a9bd34221d

Comment by Gerrit Updater [ 22/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41977/
Subject: LU-14488 o2ib: Use rdma_connect_locked if it is defined
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: d43375868aba4edcf0bc637256a9fb102709f14f

Generated at Sat Feb 10 03:10:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.