[LU-8418] node fails to kdump after lbug crash Created: 20/Jul/16  Updated: 19/Nov/16  Resolved: 19/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A customer reported that LBUG() hit does not always cause a crash dump.
Sometimes the system hangs after LBUG message in the logs.

No kdump was generated from this crash.

logs show the panic was not clean; after the LBUG, log messages continue from the node, including cpu soft lockups:

Eventually a crash dump was triggered manually through serial console.

It showed that the lbug thread hanged in memory allocation while all CPUs are spinning trying to get a spinlock holding by the lbug thread itself.

PID: 38035  TASK: ffff880c83804100  CPU: 7   COMMAND: "mdt_446"
 #0 [ffff880cbaf7d6b0] schedule at ffffffff814ea122
 #1 [ffff880cbaf7d778] __cond_resched at ffffffff81061b4a
 #2 [ffff880cbaf7d798] _cond_resched at ffffffff814eab30
 #3 [ffff880cbaf7d7a8] kmem_cache_alloc_notrace at ffffffff8115f385
 #4 [ffff880cbaf7d7d8] call_usermodehelper_setup at ffffffff81089e5d
 #5 [ffff880cbaf7d828] libcfs_run_upcall at ffffffffa04ee9c0 [libcfs]
 #6 [ffff880cbaf7d8a8] libcfs_run_lbug_upcall at ffffffffa04eed5d [libcfs]
 #7 [ffff880cbaf7d928] lbug_with_loc at ffffffffa04eee38 [libcfs]
 #8 [ffff880cbaf7d948] ldlm_export_flock_put at ffffffffa07e137a [ptlrpc]
 #9 [ffff880cbaf7d968] cfs_hash_bd_del_locked at ffffffffa04ffab1 [libcfs]
#10 [ffff880cbaf7d998] cfs_hash_del at ffffffffa0502811 [libcfs]
#11 [ffff880cbaf7d9e8] ldlm_flock_blocking_unlink at ffffffffa07e1b82 [ptlrpc]
#12 [ffff880cbaf7d9f8] ldlm_process_flock_lock at ffffffffa07e25a2 [ptlrpc]
#13 [ffff880cbaf7daf8] ldlm_reprocess_queue at ffffffffa07b6132 [ptlrpc]
#14 [ffff880cbaf7db48] ldlm_process_flock_lock at ffffffffa07e265f [ptlrpc]
#15 [ffff880cbaf7dc48] ldlm_lock_enqueue at ffffffffa07b7533 [ptlrpc]
#16 [ffff880cbaf7dca8] ldlm_handle_enqueue0 at ffffffffa07df0ef [ptlrpc]
#17 [ffff880cbaf7dd18] mdt_enqueue at ffffffffa0d18a16 [mdt]
#18 [ffff880cbaf7dd38] mdt_handle_common at ffffffffa0d0bffa [mdt]
#19 [ffff880cbaf7dd88] mdt_regular_handle at ffffffffa0d0ceb5 [mdt]


 Comments   
Comment by Alexander Zarochentsev [ 20/Jul/16 ]

Proposed patch is add ability to disable lnet upcall calling. Currently there is no check that the lnet upcall is set to empty string or to non-existing executable file. I think it is not used in most Lustre setups but the attempt to call it reduces chances of getting a crash dump.
I will upload patch soon.

Comment by Gerrit Updater [ 20/Jul/16 ]

Alexander Zarochentsev (alexander.zarochentsev@seagate.com) uploaded a new patch: http://review.whamcloud.com/21440
Subject: LU-8418 libcfs: do not call empty lnet upcall
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: affe720351475a860925be30511ebe4242651cd2

Comment by Gerrit Updater [ 19/Nov/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21440/
Subject: LU-8418 libcfs: remove lnet upcall code
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4bc9bae59fea315bdd36e8170e8388d5fce2a397

Comment by Peter Jones [ 19/Nov/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:17:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.