What would be incredibly useful for debugging is if there was some way to get some additional information printed with the kernel stack trace, such as which MDT/OST target a thread was working on, maybe if it is holding any DLM locks, etc.
One option would be to have some reserved fields in the thread-local storage or lu_env that holds pointers to the OBD device (or just the name), pointers to the DLM lock(s), etc. and then the LASSERT() or lbug_with_loc() looks up this information and prints it before triggering panic() or going to sleep. The fields in the thread-local area would need to be "well defined" so that they do not depend on the thread context, and they should always contain valid pointers (e.g. set when a DLM lock is acquired, NULL when the lock is released, or NULL when a server thread stops processing an RPC or when a client thread exits OSC/LOV/MDC/LMV).
It would also be useful on the server to print in the stack trace when the thread has a journal transaction open, and potentially this could also be submitted to the upstream kernel to print current->journal_info as part of the stack trace? For now, this could at least be printed by libcfs_call_trace().
It might be too messy to set/clear a field whenever a mutex/semaphore is held,
Thoughts?
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58346
Subject: LU-17242 libcfs: deduplicate macros with ENUM2STR
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: de80f068969f03df28df2b3b6c61739613f0cab0