Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.12.7
-
3.10.0-1160.45.1.1chaos.ch6.x86_64
lustre-2.12.7_2.llnl
3.10.0-1160.53.1.1chaos.ch6.x86_64
lustre-2.12.8_6.llnl
RHEL7.9
zfs-0.7.11-9.8llnl
-
3
-
9223372036854775807
Description
We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:
2022-02-08 09:43:10 [1644342190.528916] Call Trace: queued_spin_lock_slowpath+0xb/0xf _raw_spin_lock+0x30/0x40 cfs_percpt_lock+0xc1/0x110 [libcfs] lnet_discover_peer_locked+0xa0/0x450 [lnet] ? wake_up_atomic_t+0x30/0x30 LNetPrimaryNID+0xd5/0x220 [lnet] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] target_handle_connect+0x12f1/0x2b90 [ptlrpc] ? enqueue_task_fair+0x208/0x6c0 ? check_preempt_curr+0x80/0xa0 ? ttwu_do_wakeup+0x19/0x100 tgt_request_handle+0x4fa/0x1570 [ptlrpc] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] ? __getnstimeofday64+0x3f/0xd0 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] ? __wake_up_common_lock+0x91/0xc0 ? sched_feat_set+0xf0/0xf0 ptlrpc_main+0xc49/0x1c50 [ptlrpc] ? __switch_to+0xce/0x5a0 ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] kthread+0xd1/0xe0 ? insert_kthread_work+0x40/0x40 ret_from_fork_nospec_begin+0x21/0x21 ? insert_kthread_work+0x40/0x40
Some servers never exit recovery, and others do but seem to be unable to service requests.
Seen during the same lustre server update where we saw LU-15539 but appears to be a separate issue.
Patch stacks are:
https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl