Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17584

track all taken semaphores/mutexes in Lustre to simplify debugging

Details

    • New Feature
    • Resolution: Unresolved
    • Trivial
    • None
    • None
    • 9223372036854775807

    Description

      not for landing I guess, but still can be useful in some cases.

      Attachments

        Issue Links

          Activity

            [LU-17584] track all taken semaphores/mutexes in Lustre to simplify debugging

            Neil suggested to use in-kernel lockdep feature and that makes a lot of sense (because then we would cover all in-kernel mutexes as well), but the problem is that lockdep turns itself off upon a first problem and Lustre hits that during every regular mount. I recall we discussed this few times in the past but it was said the problem is not serious and we can live with that warning, especially given lockdep is disabled in production.

            bzzz Alex Zhuravlev added a comment - Neil suggested to use in-kernel lockdep feature and that makes a lot of sense (because then we would cover all in-kernel mutexes as well), but the problem is that lockdep turns itself off upon a first problem and Lustre hits that during every regular mount. I recall we discussed this few times in the past but it was said the problem is not serious and we can live with that warning, especially given lockdep is disabled in production.

            This functionality would align closely with what I want in LU-16625 - some way for stack traces to dump LDLM lock resources, FIDs of files being modified, RPC XID and client NID, etc. so that when we see a stack trace we can actually have some idea of the file/client that is involved with the problem.

            adilger Andreas Dilger added a comment - This functionality would align closely with what I want in LU-16625 - some way for stack traces to dump LDLM lock resources, FIDs of files being modified, RPC XID and client NID, etc. so that when we see a stack trace we can actually have some idea of the file/client that is involved with the problem.

            example of output:

            [root@tmp ~]# cat /sys/kernel/debug/lnet/locks 
            4 locks:
             10239 r_cache.c:378
             10239 /vvp_io.c:1354
             9996 /vvp_io.c:1354
             9610 te/file.c:5274
            
            bzzz Alex Zhuravlev added a comment - example of output: [root@tmp ~]# cat /sys/kernel/debug/lnet/locks 4 locks: 10239 r_cache.c:378 10239 /vvp_io.c:1354 9996 /vvp_io.c:1354 9610 te/file.c:5274
            bzzz Alex Zhuravlev added a comment - https://review.whamcloud.com/c/fs/lustre-release/+/53885

            People

              wc-triage WC Triage
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: