Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.12.7
-
3.10.0-1160.45.1.1chaos.ch6.x86_64
lustre-2.12.7_2.llnl
3.10.0-1160.53.1.1chaos.ch6.x86_64
lustre-2.12.8_6.llnl
RHEL7.9
zfs-0.7.11-9.8llnl
-
3
-
9223372036854775807
Description
We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:
2022-02-08 09:43:10 [1644342190.528916] Call Trace: queued_spin_lock_slowpath+0xb/0xf _raw_spin_lock+0x30/0x40 cfs_percpt_lock+0xc1/0x110 [libcfs] lnet_discover_peer_locked+0xa0/0x450 [lnet] ? wake_up_atomic_t+0x30/0x30 LNetPrimaryNID+0xd5/0x220 [lnet] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] target_handle_connect+0x12f1/0x2b90 [ptlrpc] ? enqueue_task_fair+0x208/0x6c0 ? check_preempt_curr+0x80/0xa0 ? ttwu_do_wakeup+0x19/0x100 tgt_request_handle+0x4fa/0x1570 [ptlrpc] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] ? __getnstimeofday64+0x3f/0xd0 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] ? __wake_up_common_lock+0x91/0xc0 ? sched_feat_set+0xf0/0xf0 ptlrpc_main+0xc49/0x1c50 [ptlrpc] ? __switch_to+0xce/0x5a0 ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] kthread+0xd1/0xe0 ? insert_kthread_work+0x40/0x40 ret_from_fork_nospec_begin+0x21/0x21 ? insert_kthread_work+0x40/0x40
Some servers never exit recovery, and others do but seem to be unable to service requests.
Seen during the same lustre server update where we saw LU-15539 but appears to be a separate issue.
Patch stacks are:
https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.