Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Currently, logging with ost.OSS.ost_io.req_buffer_history and get_param mds.MDS.mdt.req_history only provides client NIDs.  It would be way more helpful if the client UIDs and GIDs were included in the logs

      Attachments

        Issue Links

          Activity

            [LU-18180] UID in req_buffer_history

            I've flip-flopped and don't think LU-18179 is the right place for my comments either. I'll post them here, and you can tell me if this is related to what you want the UID/GID for...

            Rather than tuning a large (and continually changing) number of e.g. UID rules, it would be best to set the default TBF rules to automatically throttle jobs (by UID or JobID or NID or PROJID) that are using too much of the server resources when there is contention on the server. The TBF rules are already processing every RPC that arrives at the server, so IMHO this is the right place to detect RPC overload and throttle the offenders rather than adding an extra layer to process the RPCs again in userspace.

            It should be possible to specify a default TBF rule like "change default rate=1000" as described in LU-14501 to cap individual UIDs at 1000 RPCs/sec, but if the server cannot process the RPCs at the required rate across all UIDs then it will try to evenly balance the available processing rate across the UIDs submitting RPCs. For example, say the OST can handle 2000 IOPS in total. If there are only 2 UIDs running IOPS-intensive jobs, each one should be able to use up to their full 1000 IOPS limit. If there is one UID with an IOPS-intensive job (1000 IOPS+), but 9 other UIDs trying to run "normal" jobs (150 IOPS) at the same time, then each UID would initially get 2000/10 = 200 IOPS. The 9 "normal" jobs would have all of their RPCs processed every second, total 9 x 150 = 1350 IOPS, and the one IOPS-intensive job could use the remaining 650 IOPS without affecting the other jobs. There should be some "memory/credit" for UIDs that don't use all of their IOPS in the last few slices.

            adilger Andreas Dilger added a comment - I've flip-flopped and don't think LU-18179 is the right place for my comments either. I'll post them here, and you can tell me if this is related to what you want the UID/GID for... Rather than tuning a large (and continually changing) number of e.g. UID rules, it would be best to set the default TBF rules to automatically throttle jobs (by UID or JobID or NID or PROJID) that are using too much of the server resources when there is contention on the server. The TBF rules are already processing every RPC that arrives at the server, so IMHO this is the right place to detect RPC overload and throttle the offenders rather than adding an extra layer to process the RPCs again in userspace. It should be possible to specify a default TBF rule like "change default rate=1000" as described in LU-14501 to cap individual UIDs at 1000 RPCs/sec, but if the server cannot process the RPCs at the required rate across all UIDs then it will try to evenly balance the available processing rate across the UIDs submitting RPCs. For example, say the OST can handle 2000 IOPS in total. If there are only 2 UIDs running IOPS-intensive jobs, each one should be able to use up to their full 1000 IOPS limit. If there is one UID with an IOPS-intensive job (1000 IOPS+), but 9 other UIDs trying to run "normal" jobs (150 IOPS) at the same time, then each UID would initially get 2000/10 = 200 IOPS. The 9 "normal" jobs would have all of their RPCs processed every second, total 9 x 150 = 1350 IOPS, and the one IOPS-intensive job could use the remaining 650 IOPS without affecting the other jobs. There should be some "memory/credit" for UIDs that don't use all of their IOPS in the last few slices.

            Since LU-16077 landed, there are UID and GID fields in the ptlrpc_body for the process that is triggering the RPC, not necessarily for the inode/object that the RPC was operating on. These could be added to req_buffer_history relatively easily.

            However, it would also be good to know how you are planning to use this information? From our recent discussion it sounds like you are parsing the req_buffer_history in real time to continually tune TBF rules to slow down "bad" user jobs? IMHO, I think this is the wrong way to use TBF, since it is basically duplicating the functionality of TBF in userspace and adds needless overhead to process every request (in ASCII) again in userspace and add a large number of TBF rules targeting specific users.

            IMHO, it would be better to specify "generic" TBF rules that would automatically throttle jobs (by UID or JobID or NID) that are using too much of the server resources when there is contention on the server. The TBF rules are already processing every RPC that arrives at the server, so IMHO this is the right place to detect RPC overload and throttle the offenders rather than adding an extra layer to process the RPCs again in userspace. See my comments in LU-18179 for details, to keep the TBF discussion in one place.

            adilger Andreas Dilger added a comment - Since LU-16077 landed, there are UID and GID fields in the ptlrpc_body for the process that is triggering the RPC, not necessarily for the inode/object that the RPC was operating on. These could be added to req_buffer_history relatively easily. However, it would also be good to know how you are planning to use this information? From our recent discussion it sounds like you are parsing the req_buffer_history in real time to continually tune TBF rules to slow down "bad" user jobs? IMHO, I think this is the wrong way to use TBF, since it is basically duplicating the functionality of TBF in userspace and adds needless overhead to process every request (in ASCII) again in userspace and add a large number of TBF rules targeting specific users. IMHO, it would be better to specify "generic" TBF rules that would automatically throttle jobs (by UID or JobID or NID) that are using too much of the server resources when there is contention on the server . The TBF rules are already processing every RPC that arrives at the server, so IMHO this is the right place to detect RPC overload and throttle the offenders rather than adding an extra layer to process the RPCs again in userspace. See my comments in LU-18179 for details, to keep the TBF discussion in one place.

            People

              mjaguil Michael Aguilar
              mjaguil Michael Aguilar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: