Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17242

Clean up and Improve Lustre Debugging

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Some parts of Lustre debugging have been superseded by newer kernel features (such as CFS_CHECK_STACK) and should be removed.

      Certain subsystems stack custom macros on-top of Lustre's already custom debugging. Those should be simplified if possible.

      Other subsystems use low-level debugging functions such as libcfs_debug_msg. These should use the higher level macros. That way, the underlying debugging implementation can more easily be swapped out.

      Attachments

        Issue Links

          Activity

            [LU-17242] Clean up and Improve Lustre Debugging

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58346
            Subject: LU-17242 libcfs: deduplicate macros with ENUM2STR
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: de80f068969f03df28df2b3b6c61739613f0cab0

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58346 Subject: LU-17242 libcfs: deduplicate macros with ENUM2STR Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: de80f068969f03df28df2b3b6c61739613f0cab0

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57665
            Subject: LU-17242 libcfs: implement LUSTRE_TRACE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 332c509f7805d3a06461627622933511ecf37e11

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57665 Subject: LU-17242 libcfs: implement LUSTRE_TRACE Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 332c509f7805d3a06461627622933511ecf37e11

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57356
            Subject: LU-17242 libcfs: use sched_show_task() for thread dumping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a309b96ebfd50ed7e34404839b7164965f40876b

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57356 Subject: LU-17242 libcfs: use sched_show_task() for thread dumping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a309b96ebfd50ed7e34404839b7164965f40876b

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53625/
            Subject: LU-17242 debug: use dump_stack() where possible
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ecac0c175d934fd5624c9ad8db8f45dbc33fb56c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53625/ Subject: LU-17242 debug: use dump_stack() where possible Project: fs/lustre-release Branch: master Current Patch Set: Commit: ecac0c175d934fd5624c9ad8db8f45dbc33fb56c

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53625
            Subject: LU-17242 debug: use dump_stack() where possible
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1debf098028fec7c27310d0985934eeee4e0a67a

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53625 Subject: LU-17242 debug: use dump_stack() where possible Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1debf098028fec7c27310d0985934eeee4e0a67a

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52883/
            Subject: LU-17242 debug: remove CFS_CHECK_STACK
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: edb968d04f3a3c8054c12daee1ba557f855055ce

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52883/ Subject: LU-17242 debug: remove CFS_CHECK_STACK Project: fs/lustre-release Branch: master Current Patch Set: Commit: edb968d04f3a3c8054c12daee1ba557f855055ce
            timday Tim Day added a comment - - edited

            Seems useful. I think we could register a custom panic handler. I see upstream drivers (like drivers/net/ipa/ipa_smp2p.c) doing something like that. We could avoid extending custom Lustre debugging and it should work on every panic. Adding `current->journal_info` to the handler would be easy. Getting the Lustre specific info might be tougher, but I saw some ideas upstream we could probably copy. The `ipa` just embedded the `notifier_block` in a larger struct and used `container_of` to get everything else.

            timday Tim Day added a comment - - edited Seems useful. I think we could register a custom panic handler. I see upstream drivers (like drivers/net/ipa/ipa_smp2p.c) doing something like that. We could avoid extending custom Lustre debugging and it should work on every panic. Adding `current->journal_info` to the handler would be easy. Getting the Lustre specific info might be tougher, but I saw some ideas upstream we could probably copy. The `ipa` just embedded the `notifier_block` in a larger struct and used `container_of` to get everything else.

            What would be incredibly useful for debugging is if there was some way to get some additional information printed with the kernel stack trace, such as which MDT/OST target a thread was working on, maybe if it is holding any DLM locks, etc.

            One option would be to have some reserved fields in the thread-local storage or lu_env that holds pointers to the OBD device (or just the name), pointers to the DLM lock(s), etc. and then the LASSERT() or lbug_with_loc() looks up this information and prints it before triggering panic() or going to sleep. The fields in the thread-local area would need to be "well defined" so that they do not depend on the thread context, and they should always contain valid pointers (e.g. set when a DLM lock is acquired, NULL when the lock is released, or NULL when a server thread stops processing an RPC or when a client thread exits OSC/LOV/MDC/LMV).

            It would also be useful on the server to print in the stack trace when the thread has a journal transaction open, and potentially this could also be submitted to the upstream kernel to print current->journal_info as part of the stack trace? For now, this could at least be printed by libcfs_call_trace().

            It might be too messy to set/clear a field whenever a mutex/semaphore is held,

            Thoughts?

            adilger Andreas Dilger added a comment - What would be incredibly useful for debugging is if there was some way to get some additional information printed with the kernel stack trace, such as which MDT/OST target a thread was working on, maybe if it is holding any DLM locks, etc. One option would be to have some reserved fields in the thread-local storage or lu_env that holds pointers to the OBD device (or just the name), pointers to the DLM lock(s), etc. and then the LASSERT() or lbug_with_loc() looks up this information and prints it before triggering panic() or going to sleep. The fields in the thread-local area would need to be "well defined" so that they do not depend on the thread context, and they should always contain valid pointers (e.g. set when a DLM lock is acquired, NULL when the lock is released, or NULL when a server thread stops processing an RPC or when a client thread exits OSC/LOV/MDC/LMV). It would also be useful on the server to print in the stack trace when the thread has a journal transaction open, and potentially this could also be submitted to the upstream kernel to print current->journal_info as part of the stack trace? For now, this could at least be printed by libcfs_call_trace() . It might be too messy to set/clear a field whenever a mutex/semaphore is held, Thoughts?

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52946
            Subject: LU-17242 debug: CDEBUG performance testing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 00918c7a9402599e1ffd56b613ddf5ec67cc421d

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52946 Subject: LU-17242 debug: CDEBUG performance testing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 00918c7a9402599e1ffd56b613ddf5ec67cc421d

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52897
            Subject: LU-17242 ptlrpc: refactor DEBUG_REQ to use CDEBUG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 16368c2dbe8d3b095f43701237eed490e5a93c6d

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52897 Subject: LU-17242 ptlrpc: refactor DEBUG_REQ to use CDEBUG Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 16368c2dbe8d3b095f43701237eed490e5a93c6d

            People

              timday Tim Day
              timday Tim Day
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: