[LU-17242] Clean up and Improve Lustre Debugging Created: 30/Oct/23 Updated: 04/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Tim Day | Assignee: | Tim Day |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Some parts of Lustre debugging have been superseded by newer kernel features (such as CFS_CHECK_STACK) and should be removed. Certain subsystems stack custom macros on-top of Lustre's already custom debugging. Those should be simplified if possible. Other subsystems use low-level debugging functions such as libcfs_debug_msg. These should use the higher level macros. That way, the underlying debugging implementation can more easily be swapped out. |
| Comments |
| Comment by Gerrit Updater [ 30/Oct/23 ] |
|
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52883 |
| Comment by Gerrit Updater [ 31/Oct/23 ] |
|
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52897 |
| Comment by Gerrit Updater [ 02/Nov/23 ] |
|
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52946 |
| Comment by Andreas Dilger [ 05/Nov/23 ] |
|
What would be incredibly useful for debugging is if there was some way to get some additional information printed with the kernel stack trace, such as which MDT/OST target a thread was working on, maybe if it is holding any DLM locks, etc. One option would be to have some reserved fields in the thread-local storage or lu_env that holds pointers to the OBD device (or just the name), pointers to the DLM lock(s), etc. and then the LASSERT() or lbug_with_loc() looks up this information and prints it before triggering panic() or going to sleep. The fields in the thread-local area would need to be "well defined" so that they do not depend on the thread context, and they should always contain valid pointers (e.g. set when a DLM lock is acquired, NULL when the lock is released, or NULL when a server thread stops processing an RPC or when a client thread exits OSC/LOV/MDC/LMV). It would also be useful on the server to print in the stack trace when the thread has a journal transaction open, and potentially this could also be submitted to the upstream kernel to print current->journal_info as part of the stack trace? For now, this could at least be printed by libcfs_call_trace(). It might be too messy to set/clear a field whenever a mutex/semaphore is held, Thoughts? |
| Comment by Tim Day [ 06/Nov/23 ] |
|
Seems useful. I think we could register a custom panic handler. I see upstream drivers (like drivers/net/ipa/ipa_smp2p.c) doing something like that. We could avoid extending custom Lustre debugging and it should work on every panic. Adding `current->journal_info` to the handler would be easy. Getting the Lustre specific info might be tougher, but I saw some ideas upstream we could probably copy. The `ipa` just embedded the `notifier_block` in a larger struct and used `container_of` to get everything else. |
| Comment by Gerrit Updater [ 08/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52883/ |
| Comment by Gerrit Updater [ 09/Jan/24 ] |
|
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53625 |
| Comment by Gerrit Updater [ 04/Feb/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53625/ |