Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      Lustre has server side QoS mechanism based on NRS TBF policy (LU-3558). NRS TBF policy is able to enforce rate limitations based on both NID rules and JOBID rules. However, when using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.

      In order to prevent this from happening, a client side mechanism needs to be added to make the RPC sending chechanism at least more fair for all jobs.

      Attachments

        Activity

          [LU-7982] Client side QoS based on jobid

          I am wondering whether we could add cgroup support for JobID. For example, if obd_jobid_var is "cgroup_path", we use the path of cgroup as JobID. This prevents duplicated codes, and also enable us to implement cgroup support for QoS on the server side in the future.

          lixi Li Xi (Inactive) added a comment - I am wondering whether we could add cgroup support for JobID. For example, if obd_jobid_var is "cgroup_path", we use the path of cgroup as JobID. This prevents duplicated codes, and also enable us to implement cgroup support for QoS on the server side in the future.

          Anybody has some time to review these two patches?

          http://review.whamcloud.com/#/c/19729/
          http://review.whamcloud.com/#/c/19896/

          And we are going to working on cgroup support. The current QoS is based on JobID. Since the cgroup path of a task can be got by task_cgroup_path(), we should be able to add cgroup support for QoS easily, like what we did with NRS TBF jobID/NID.

          lixi Li Xi (Inactive) added a comment - Anybody has some time to review these two patches? http://review.whamcloud.com/#/c/19729/ http://review.whamcloud.com/#/c/19896/ And we are going to working on cgroup support. The current QoS is based on JobID. Since the cgroup path of a task can be got by task_cgroup_path(), we should be able to add cgroup support for QoS easily, like what we did with NRS TBF jobID/NID.

          We've got encouraging results that the QoS patches finally work well. In order
          to check the function, we need to use NRS TBF policy on OSS.

          And following is the results without the QoS patches:

          1. Run dd along (NRS policy: fifo)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 13.479 s, 79.7 MB/s
          
          2. Run dd/mydd at the same time on the same client (NRS policy: fifo)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 24.6039 s, 43.6 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 25.3535 s, 42.4 MB/s
          
          3. Run dd/mydd/thdd at the same time on the same client (NRS policy: fifo)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 31.3823 s, 34.2 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 32.4403 s, 33.1 MB/s
          [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 34.4943 s, 31.1 MB/s
          
          4. Change the NRS policy to TBF jobid
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
          ost.OSS.ost_io.nrs_policies=tbf jobid
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" 
          ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10"
          ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5"
          ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5
          
          5. Run dd/mydd/thdd along (NRS policy: TBF jobid)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 59.2141 s, 18.1 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 103.855 s, 10.3 MB/s
          [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 199.384 s, 5.4 MB/s
          
          6. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 118.265 s, 9.1 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 120.273 s, 8.9 MB/s
          
          7. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 198.492 s, 5.4 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 204.857 s, 5.2 MB/s
          [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 210.522 s, 5.1 MB/s
          

          As we can see from the results, the job IDs with higher RPC rates are affected
          by the job ID with lower RPC rate.

          And following is the results with the QoS patches (19729 + 19896):

          1. Run dd along (NRS policy: fifo)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 13.572 s, 79.1 MB/s
          
          2. Run dd/mydd at the same time on the same client (NRS policy: fifo)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 24.0809 s, 44.6 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 24.0959 s, 44.6 MB/s
          
          3. Change the NRS policy to TBF jobid
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
          ost.OSS.ost_io.nrs_policies=tbf jobid
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" 
          ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10"
          ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10
          [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5"
          ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5
          
          4. Run dd/mydd/thdd along (NRS policy: TBF jobid)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 58.6623 s, 18.3 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 103.291 s, 10.4 MB/s
          [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 198.988 s, 5.4 MB/s
          
          5. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 64.5303 s, 16.6 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 114.617 s, 9.4 MB/s
          
          6. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid)
          [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 68.6446 s, 15.6 MB/s
          [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 123.731 s, 8.7 MB/s
          [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
          1024+0 records in
          1024+0 records out
          1073741824 bytes (1.1 GB) copied, 219.711 s, 4.9 MB/s
          

          As we can see from the results, the job IDs with different RPC rates got the
          expected RPC rates. And they did't affect each other.

          And also, when we run dd/dd2/dd3 on the same client, we can monitor the page
          usage as well as the in flight RPCs used by each Job ID. And following is the
          result. As we can see, the page cache and in flight RPCs are balanced between
          job IDs.

          [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88010b996400/osc_cache_class 
          total: 8192, assigned: 8192, current time: 7514054940465(ns), reclaim time: 0, reclaim interval: 1000000000, in flight write RPC: 10, in flight read RPC: 0
          job_id: "dd.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0
          job_id: "dd2.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0
          job_id: "dd3.0", used: 2730, max: 2730, reclaim time: 0(ns), in flight write RPC: 4, in flight read RPC: 0
          
          lixi Li Xi (Inactive) added a comment - We've got encouraging results that the QoS patches finally work well. In order to check the function, we need to use NRS TBF policy on OSS. And following is the results without the QoS patches: 1. Run dd along (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 13.479 s, 79.7 MB/s 2. Run dd/mydd at the same time on the same client (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 24.6039 s, 43.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 25.3535 s, 42.4 MB/s 3. Run dd/mydd/thdd at the same time on the same client (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 31.3823 s, 34.2 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 32.4403 s, 33.1 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 34.4943 s, 31.1 MB/s 4. Change the NRS policy to TBF jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid" ost.OSS.ost_io.nrs_policies=tbf jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10" ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5" ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5 5. Run dd/mydd/thdd along (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 59.2141 s, 18.1 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 103.855 s, 10.3 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 199.384 s, 5.4 MB/s 6. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 118.265 s, 9.1 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 120.273 s, 8.9 MB/s 7. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 198.492 s, 5.4 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 204.857 s, 5.2 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 210.522 s, 5.1 MB/s As we can see from the results, the job IDs with higher RPC rates are affected by the job ID with lower RPC rate. And following is the results with the QoS patches (19729 + 19896): 1. Run dd along (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 13.572 s, 79.1 MB/s 2. Run dd/mydd at the same time on the same client (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 24.0809 s, 44.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 24.0959 s, 44.6 MB/s 3. Change the NRS policy to TBF jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid" ost.OSS.ost_io.nrs_policies=tbf jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10" ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5" ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5 4. Run dd/mydd/thdd along (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 58.6623 s, 18.3 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 103.291 s, 10.4 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 198.988 s, 5.4 MB/s 5. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 64.5303 s, 16.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 114.617 s, 9.4 MB/s 6. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 68.6446 s, 15.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 123.731 s, 8.7 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 219.711 s, 4.9 MB/s As we can see from the results, the job IDs with different RPC rates got the expected RPC rates. And they did't affect each other. And also, when we run dd/dd2/dd3 on the same client, we can monitor the page usage as well as the in flight RPCs used by each Job ID. And following is the result. As we can see, the page cache and in flight RPCs are balanced between job IDs. [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88010b996400/osc_cache_class total: 8192, assigned: 8192, current time: 7514054940465(ns), reclaim time: 0, reclaim interval: 1000000000, in flight write RPC: 10, in flight read RPC: 0 job_id: "dd.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0 job_id: "dd2.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0 job_id: "dd3.0", used: 2730, max: 2730, reclaim time: 0(ns), in flight write RPC: 4, in flight read RPC: 0

          Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19896
          Subject: LU-7982 osc: qos support for in flight RPC slot usage
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: f67a81043213250ee818e7e6bfb920f8eaaba004

          gerrit Gerrit Updater added a comment - Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19896 Subject: LU-7982 osc: qos support for in flight RPC slot usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f67a81043213250ee818e7e6bfb920f8eaaba004

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19317/
          Subject: LU-7982 libcfs: memory allocation without CPT for binheap
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: cbe5b45a1d157c7345bd1352c257bee22ad8d085

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19317/ Subject: LU-7982 libcfs: memory allocation without CPT for binheap Project: fs/lustre-release Branch: master Current Patch Set: Commit: cbe5b45a1d157c7345bd1352c257bee22ad8d085

          With the updated version (patch set 2) of 19729, all busy job IDs will balance their page usages much much more quickly than
          before. And that makes me more confident with this design.

          lixi Li Xi (Inactive) added a comment - With the updated version (patch set 2) of 19729, all busy job IDs will balance their page usages much much more quickly than before. And that makes me more confident with this design.

          People

            lixi_wc Li Xi
            lixi Li Xi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated: