Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0
-
3
-
16220
Description
In our testing a lustre client hangs forever after OST failover. The client haŠ² 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu.
A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv:
crash> foreach ptlrpcd_rcv bt PID: 17113 TASK: ffff8806323f8ae0 CPU: 22 COMMAND: "ptlrpcd_rcv" #0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6 #1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965 #2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca #3 [ffff880655547ef0] notify_die at ffffffff810a12be #4 [ffff880655547f20] do_nmi at ffffffff8152862b #5 [ffff880655547f50] nmi at ffffffff81527ef0 [exception RIP: cl_page_put+426] RIP: ffffffffa05f428a RSP: ffff8805d85d9990 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88047c921800 RCX: ffff8805d85d9ab8 RDX: 0000000000000000 RSI: ffff88047c921800 RDI: ffff880bc515ee60 RBP: ffff8805d85d99d0 R8: 0000000000000040 R9: ffff8805d85d99a0 R10: 0000000000009500 R11: ffff880303aac8f0 R12: ffff880bc515ee60 R13: 000000000000004a R14: ffff880bc51d2e90 R15: ffff880bc515ee60 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass] #7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc] #8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc] #9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc] #10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc] #11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc] #12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc] #13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc] #14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc] #15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc] #16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc] #17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc] #18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc] #19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66 #20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a crash>
the hang is caused by an endless loop in osc_lock_weight(),
when CLP_GANG_RESCHED is returned from osc_page_gang_lookup() and the page scan is restarted from the very beginning. Next round CLP_GANG_RESCHED is likely to happen again if number of scanned pages is enough large.
it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang.
Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too.