[LU-17242] Clean up and Improve Lustre Debugging Created: 30/Oct/23  Updated: 04/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Tim Day Assignee: Tim Day
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17075 replay-dual test_16 ZFS MDS crash: o... Resolved
is related to LU-16625 improved Lustre thread debugging Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Some parts of Lustre debugging have been superseded by newer kernel features (such as CFS_CHECK_STACK) and should be removed.

Certain subsystems stack custom macros on-top of Lustre's already custom debugging. Those should be simplified if possible.

Other subsystems use low-level debugging functions such as libcfs_debug_msg. These should use the higher level macros. That way, the underlying debugging implementation can more easily be swapped out.



 Comments   
Comment by Gerrit Updater [ 30/Oct/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52883
Subject: LU-17242 debug: remove CFS_CHECK_STACK
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0d01390b5c371e76b315d38ff5213078057317da

Comment by Gerrit Updater [ 31/Oct/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52897
Subject: LU-17242 ptlrpc: refactor DEBUG_REQ to use CDEBUG
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 16368c2dbe8d3b095f43701237eed490e5a93c6d

Comment by Gerrit Updater [ 02/Nov/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52946
Subject: LU-17242 debug: CDEBUG performance testing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 00918c7a9402599e1ffd56b613ddf5ec67cc421d

Comment by Andreas Dilger [ 05/Nov/23 ]

What would be incredibly useful for debugging is if there was some way to get some additional information printed with the kernel stack trace, such as which MDT/OST target a thread was working on, maybe if it is holding any DLM locks, etc.

One option would be to have some reserved fields in the thread-local storage or lu_env that holds pointers to the OBD device (or just the name), pointers to the DLM lock(s), etc. and then the LASSERT() or lbug_with_loc() looks up this information and prints it before triggering panic() or going to sleep. The fields in the thread-local area would need to be "well defined" so that they do not depend on the thread context, and they should always contain valid pointers (e.g. set when a DLM lock is acquired, NULL when the lock is released, or NULL when a server thread stops processing an RPC or when a client thread exits OSC/LOV/MDC/LMV).

It would also be useful on the server to print in the stack trace when the thread has a journal transaction open, and potentially this could also be submitted to the upstream kernel to print current->journal_info as part of the stack trace? For now, this could at least be printed by libcfs_call_trace().

It might be too messy to set/clear a field whenever a mutex/semaphore is held,

Thoughts?

Comment by Tim Day [ 06/Nov/23 ]

Seems useful. I think we could register a custom panic handler. I see upstream drivers (like drivers/net/ipa/ipa_smp2p.c) doing something like that. We could avoid extending custom Lustre debugging and it should work on every panic. Adding `current->journal_info` to the handler would be easy. Getting the Lustre specific info might be tougher, but I saw some ideas upstream we could probably copy. The `ipa` just embedded the `notifier_block` in a larger struct and used `container_of` to get everything else.

Comment by Gerrit Updater [ 08/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52883/
Subject: LU-17242 debug: remove CFS_CHECK_STACK
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: edb968d04f3a3c8054c12daee1ba557f855055ce

Comment by Gerrit Updater [ 09/Jan/24 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53625
Subject: LU-17242 debug: use dump_stack() where possible
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1debf098028fec7c27310d0985934eeee4e0a67a

Comment by Gerrit Updater [ 04/Feb/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53625/
Subject: LU-17242 debug: use dump_stack() where possible
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ecac0c175d934fd5624c9ad8db8f45dbc33fb56c

Generated at Sat Feb 10 03:33:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.