[LU-15824] lnet not working with EL5.4 MOFED5.2 Created: 05/May/22  Updated: 13/Jul/22  Resolved: 13/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Peter Jones
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File out1.dk.gz     File out2.dk.gz    
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Lnet not working with EL8.5 and MOFED5.2 with lustre 2.12.6.

I first see this error.

[Wed May  4 23:28:46 2022] alg: No test for adler32 (adler32-zlib)
[Wed May  4 23:28:46 2022] alg: hash: digest failed on test 1 for crc32-table: ret=126
  

And this

[Wed May  4 23:37:02 2022] LNetError: 7708:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.141.16.185@o2ib417: -125
[Wed May  4 23:37:02 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 924 seconds
[Wed May  4 23:37:59 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 981 seconds
[Wed May  4 23:38:49 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 1031 seconds
[Wed May  4 23:38:49 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Skipped 1 previous similar message
[Wed May  4 23:40:04 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 1106 seconds
[Wed May  4 23:40:04 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Skipped 1 previous similar message
[Wed May  4 23:40:04 2022] INFO: task kworker/u256:1:7922 blocked for more than 120 seconds.
[Wed May  4 23:40:04 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.15.1.1nas.el8.t4.x86_64 #1
[Wed May  4 23:40:04 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed May  4 23:40:04 2022] kworker/u256:1  D    0  7922      2 0x80004080
[Wed May  4 23:40:04 2022] Workqueue: rdma_cm cma_work_handler [rdma_cm]
[Wed May  4 23:40:04 2022] Call Trace:
[Wed May  4 23:40:04 2022]  __schedule+0x2a9/0x710
[Wed May  4 23:40:04 2022]  schedule+0x4d/0xc0
[Wed May  4 23:40:04 2022]  schedule_preempt_disabled+0x11/0x20
[Wed May  4 23:40:04 2022]  __mutex_lock.isra.5+0x343/0x550
[Wed May  4 23:40:04 2022]  ? kiblnd_post_rx+0x1ff/0x520 [ko2iblnd]
[Wed May  4 23:40:04 2022]  rdma_connect+0x1e/0x40 [rdma_cm]
[Wed May  4 23:40:04 2022]  kiblnd_cm_callback+0x1476/0x2220 [ko2iblnd]
[Wed May  4 23:40:04 2022]  ? __switch_to_asm+0x41/0x70
[Wed May  4 23:40:04 2022]  cma_cm_event_handler+0x25/0xf0 [rdma_cm]
[Wed May  4 23:40:04 2022]  cma_work_handler+0x5a/0xb0 [rdma_cm]
[Wed May  4 23:40:04 2022]  process_one_work+0x1ae/0x3a0
[Wed May  4 23:40:04 2022]  worker_thread+0x3c/0x3c0
[Wed May  4 23:40:04 2022]  ? create_worker+0x1a0/0x1a0
[Wed May  4 23:40:04 2022]  kthread+0x11d/0x140
[Wed May  4 23:40:04 2022]  ? kthread_flush_work_fn+0x10/0x10
[Wed May  4 23:40:04 2022]  ret_from_fork+0x22/0x40
[Wed May  4 23:40:54 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 1156 seconds
[Wed May  4 23:40:54 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Skipped 1 previous similar message
[Wed May  4 23:42:07 2022] INFO: task kworker/u256:1:7922 blocked for more than 120 seconds.
[Wed May  4 23:42:07 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.15.1.1nas.el8.t4.x86_64 #1
[Wed May  4 23:42:07 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed May  4 23:42:07 2022] kworker/u256:1  D    0  7922      2 0x80004080
[Wed May  4 23:42:07 2022] Workqueue: rdma_cm cma_work_handler [rdma_cm]
[Wed May  4 23:42:07 2022] Call Trace:
[Wed May  4 23:42:07 2022]  __schedule+0x2a9/0x710
[Wed May  4 23:42:07 2022]  schedule+0x4d/0xc0
[Wed May  4 23:42:07 2022]  schedule_preempt_disabled+0x11/0x20
[Wed May  4 23:42:07 2022]  __mutex_lock.isra.5+0x343/0x550
[Wed May  4 23:42:07 2022]  ? kiblnd_post_rx+0x1ff/0x520 [ko2iblnd]
[Wed May  4 23:42:07 2022]  rdma_connect+0x1e/0x40 [rdma_cm]
[Wed May  4 23:42:07 2022]  kiblnd_cm_callback+0x1476/0x2220 [ko2iblnd]
[Wed May  4 23:42:07 2022]  ? __switch_to_asm+0x41/0x70
[Wed May  4 23:42:07 2022]  cma_cm_event_handler+0x25/0xf0 [rdma_cm]
[Wed May  4 23:42:07 2022]  cma_work_handler+0x5a/0xb0 [rdma_cm]
[Wed May  4 23:42:07 2022]  process_one_work+0x1ae/0x3a0
[Wed May  4 23:42:07 2022]  worker_thread+0x3c/0x3c0
[Wed May  4 23:42:07 2022]  ? create_worker+0x1a0/0x1a0
[Wed May  4 23:42:07 2022]  kthread+0x11d/0x140
[Wed May  4 23:42:07 2022]  ? kthread_flush_work_fn+0x10/0x10
[Wed May  4 23:42:07 2022]  ret_from_fork+0x22/0x40
[Wed May  4 23:42:09 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 1231 seconds
[Wed May  4 23:42:09 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Skipped 1 previous similar message
[Wed May  4 23:42:59 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Timed out tx for 10.141.16.185@o2ib417: 1281 seconds
[Wed May  4 23:42:59 2022] LNet: 7675:0:(o2iblnd_cb.c:3421:kiblnd_check_conns()) Skipped 1 previous similar message 

See attached debug logs.



 Comments   
Comment by Peter Jones [ 05/May/22 ]

Mahmoud

Do you have the patch LU-14488 in your distribution?

Peter

Comment by Mahmoud Hanafi [ 06/May/22 ]

I don't think we have that. I will get a build with that patch.

Thanks,

Comment by Mahmoud Hanafi [ 13/Jul/22 ]

please close this

Generated at Sat Feb 10 03:21:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.