Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16357

a mechanism to inform other nodes to dump debug log

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      Now we can call libcfs_debug_dumplog() in the code to dump debug log of this node to help debug, but often we want to dump debug logs of other nodes as well because the symptom is on this node, but it may be caused by bug on other nodes.

      Having a mechanism to trigger debug log dumping on remote nodes would greatly simplify cross-node debugging. It should be possible to call something like "lctl dk --client[=NID[,NID]]" to have a user trigger a local debug log dump and/or on the specified remote NIDs. It should also be possible to run "lctl dk --mds[=IDX[,IDX]]" to dump logs on all or some MDS nodes, and "lctl dk --oss[=IDX[,IDX]]" to do the same on all or some OSS nodes. This would provide a powerful debugging feature to help isolate issues on multiple remote nodes, especially if they are not directly accessible from the server or client, and SSH into the server is not allowed from the client.

      There should be some parameter to control log dumping on the server, like debug_enable_remote to prevent malicious users from dumping the server logs and filling up the local storage. Similarly, a mechanism to avoid multiple logdumps for the same reason, so a unique ID in the RPC would be useful to have, and clients would cache this for a minute and drop any logdump RPCs with this same ID.

      There should also be a tunable parameter which allows non-root users to trigger the debug log dump, though not actually access the log file for security reasons. This allows log dumping to be triggered directly by the application or job scheduler in case of an application-level error, without the need to run as root or wait for an admin to become available. One option would be something like debug_gid=0 by default for root-only log dumping, debug_gid=GID for an administrative group, or debug_gid=-1 to allow all users to do this.

      Attachments

        Issue Links

          Activity

            [LU-16357] a mechanism to inform other nodes to dump debug log

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56903
            Subject: LU-16357 general: collective trace dump
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 45f5150d9e12fefc98cce0705071c484ea785871

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56903 Subject: LU-16357 general: collective trace dump Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 45f5150d9e12fefc98cce0705071c484ea785871
            gerrit Gerrit Updater added a comment - - edited

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49290
            Subject: LU-16357 obdclass: dump debug log on remote node
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0403a216c840c0d01b718a1b9b2cb878c4970d73

            gerrit Gerrit Updater added a comment - - edited "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49290 Subject: LU-16357 obdclass: dump debug log on remote node Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0403a216c840c0d01b718a1b9b2cb878c4970d73

            People

              laisiyao Lai Siyao
              laisiyao Lai Siyao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: