What would be incredibly useful for debugging is if there was some way to get some additional information printed with the kernel stack trace, such as which MDT/OST target a thread was working on, maybe if it is holding any DLM locks, etc.
One option would be to have some reserved fields in the thread-local storage or lu_env that holds pointers to the OBD device (or just the name), pointers to the DLM lock(s), etc. and then the LASSERT() or lbug_with_loc() looks up this information and prints it before triggering panic() or going to sleep. The fields in the thread-local area would need to be "well defined" so that they do not depend on the thread context, and they should always contain valid pointers (e.g. set when a DLM lock is acquired, NULL when the lock is released, or NULL when a server thread stops processing an RPC or when a client thread exits OSC/LOV/MDC/LMV).
It would also be useful on the server to print in the stack trace when the thread has a journal transaction open, and potentially this could also be submitted to the upstream kernel to print current->journal_info as part of the stack trace? For now, this could at least be printed by libcfs_call_trace().
It might be too messy to set/clear a field whenever a mutex/semaphore is held,
Thoughts?
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53625/
Subject: LU-17242 debug: use dump_stack() where possible
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ecac0c175d934fd5624c9ad8db8f45dbc33fb56c