Details
Description
I was thinking about how we might improve the debugging of Lustre threads that are busy (e.g. threads stuck in ldlm_cli_enqueue_local() or ldlm_completion_ast() or possibly on a mutex/spinlock).
One thing that would help, especially for post-facto debugging where we only have watchdog stack traces dumped to dmesg/messages, would be to print the FID/resource of locks that the thread is holding and/or blocked on as part of the watchdog stack trace. Since doing this in a generic way would be difficult, it would be possible to either create a thread-local data structure (similar to, or part of, "env") that contained "well-known" slots for e.g. parent/child FIDs, locked LDLM resources, next LDLM resource to lock. Possibly these would be kept in ASCII format so that the whole chunk could just be printed as-is without much interpretation (maybe walking through NUL-terminated 32-char slots and only printing those that are used).
It would also be useful to print the JobID and XID of the RPC that is being processed.
Potentially this could be looked up by PID during watchdog stack dump, but would typically only be accessed by the local thread.
The main benefit here would be that instead of just seeing the stack traces being dumped, we could also see which resources the thread is (or was) holding, and this would greatly simplify the ability to analyze stuck thread issues after the fact.