[LU-8418] node fails to kdump after lbug crash Created: 20/Jul/16 Updated: 19/Nov/16 Resolved: 19/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A customer reported that LBUG() hit does not always cause a crash dump.
logs show the panic was not clean; after the LBUG, log messages continue from the node, including cpu soft lockups: Eventually a crash dump was triggered manually through serial console. It showed that the lbug thread hanged in memory allocation while all CPUs are spinning trying to get a spinlock holding by the lbug thread itself. PID: 38035 TASK: ffff880c83804100 CPU: 7 COMMAND: "mdt_446" #0 [ffff880cbaf7d6b0] schedule at ffffffff814ea122 #1 [ffff880cbaf7d778] __cond_resched at ffffffff81061b4a #2 [ffff880cbaf7d798] _cond_resched at ffffffff814eab30 #3 [ffff880cbaf7d7a8] kmem_cache_alloc_notrace at ffffffff8115f385 #4 [ffff880cbaf7d7d8] call_usermodehelper_setup at ffffffff81089e5d #5 [ffff880cbaf7d828] libcfs_run_upcall at ffffffffa04ee9c0 [libcfs] #6 [ffff880cbaf7d8a8] libcfs_run_lbug_upcall at ffffffffa04eed5d [libcfs] #7 [ffff880cbaf7d928] lbug_with_loc at ffffffffa04eee38 [libcfs] #8 [ffff880cbaf7d948] ldlm_export_flock_put at ffffffffa07e137a [ptlrpc] #9 [ffff880cbaf7d968] cfs_hash_bd_del_locked at ffffffffa04ffab1 [libcfs] #10 [ffff880cbaf7d998] cfs_hash_del at ffffffffa0502811 [libcfs] #11 [ffff880cbaf7d9e8] ldlm_flock_blocking_unlink at ffffffffa07e1b82 [ptlrpc] #12 [ffff880cbaf7d9f8] ldlm_process_flock_lock at ffffffffa07e25a2 [ptlrpc] #13 [ffff880cbaf7daf8] ldlm_reprocess_queue at ffffffffa07b6132 [ptlrpc] #14 [ffff880cbaf7db48] ldlm_process_flock_lock at ffffffffa07e265f [ptlrpc] #15 [ffff880cbaf7dc48] ldlm_lock_enqueue at ffffffffa07b7533 [ptlrpc] #16 [ffff880cbaf7dca8] ldlm_handle_enqueue0 at ffffffffa07df0ef [ptlrpc] #17 [ffff880cbaf7dd18] mdt_enqueue at ffffffffa0d18a16 [mdt] #18 [ffff880cbaf7dd38] mdt_handle_common at ffffffffa0d0bffa [mdt] #19 [ffff880cbaf7dd88] mdt_regular_handle at ffffffffa0d0ceb5 [mdt] |
| Comments |
| Comment by Alexander Zarochentsev [ 20/Jul/16 ] |
|
Proposed patch is add ability to disable lnet upcall calling. Currently there is no check that the lnet upcall is set to empty string or to non-existing executable file. I think it is not used in most Lustre setups but the attempt to call it reduces chances of getting a crash dump. |
| Comment by Gerrit Updater [ 20/Jul/16 ] |
|
Alexander Zarochentsev (alexander.zarochentsev@seagate.com) uploaded a new patch: http://review.whamcloud.com/21440 |
| Comment by Gerrit Updater [ 19/Nov/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21440/ |
| Comment by Peter Jones [ 19/Nov/16 ] |
|
Landed for 2.9 |