[LU-7982] Client side QoS based on jobid - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- cea
- patch

Rank (Obsolete):
9223372036854775807

Description

Lustre has server side QoS mechanism based on NRS TBF policy (~~LU-3558~~). NRS TBF policy is able to enforce rate limitations based on both NID rules and JOBID rules. However, when using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.

In order to prevent this from happening, a client side mechanism needs to be added to make the RPC sending chechanism at least more fair for all jobs.

Attachments

Activity

[LU-7982] Client side QoS based on jobid

Li Xi made changes - 18/Sep/18 3:29 AM

Assignee

Original: Emoly Liu [ emoly.liu ]

New: Li Xi [ lixi_wc ]

Jean-Baptiste Riaux (Inactive) made changes - 24/Jun/16 2:42 PM

Labels

Original: patch

New: cea patch

Peter Jones made changes - 07/Jun/16 3:38 PM

End date		New: 16/May/16
Start date		New: 04/Apr/16

Li Xi (Inactive) added a comment - 16/May/16 3:15 AM

I am wondering whether we could add cgroup support for JobID. For example, if obd_jobid_var is "cgroup_path", we use the path of cgroup as JobID. This prevents duplicated codes, and also enable us to implement cgroup support for QoS on the server side in the future.

Li Xi (Inactive) added a comment - 16/May/16 3:15 AM I am wondering whether we could add cgroup support for JobID. For example, if obd_jobid_var is "cgroup_path", we use the path of cgroup as JobID. This prevents duplicated codes, and also enable us to implement cgroup support for QoS on the server side in the future.

Li Xi (Inactive) added a comment - 16/May/16 3:00 AM

Anybody has some time to review these two patches?

http://review.whamcloud.com/#/c/19729/
http://review.whamcloud.com/#/c/19896/

And we are going to working on cgroup support. The current QoS is based on JobID. Since the cgroup path of a task can be got by task_cgroup_path(), we should be able to add cgroup support for QoS easily, like what we did with NRS TBF jobID/NID.

Li Xi (Inactive) added a comment - 16/May/16 3:00 AM Anybody has some time to review these two patches? http://review.whamcloud.com/#/c/19729/ http://review.whamcloud.com/#/c/19896/ And we are going to working on cgroup support. The current QoS is based on JobID. Since the cgroup path of a task can be got by task_cgroup_path(), we should be able to add cgroup support for QoS easily, like what we did with NRS TBF jobID/NID.

Li Xi (Inactive) added a comment - 03/May/16 9:01 AM

We've got encouraging results that the QoS patches finally work well. In order
to check the function, we need to use NRS TBF policy on OSS.

And following is the results without the QoS patches:

1. Run dd along (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.479 s, 79.7 MB/s

2. Run dd/mydd at the same time on the same client (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 24.6039 s, 43.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 25.3535 s, 42.4 MB/s

3. Run dd/mydd/thdd at the same time on the same client (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.3823 s, 34.2 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 32.4403 s, 33.1 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 34.4943 s, 31.1 MB/s

4. Change the NRS policy to TBF jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
ost.OSS.ost_io.nrs_policies=tbf jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" 
ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10"
ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5"
ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5

5. Run dd/mydd/thdd along (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 59.2141 s, 18.1 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 103.855 s, 10.3 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 199.384 s, 5.4 MB/s

6. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 118.265 s, 9.1 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 120.273 s, 8.9 MB/s

7. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 198.492 s, 5.4 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 204.857 s, 5.2 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 210.522 s, 5.1 MB/s

As we can see from the results, the job IDs with higher RPC rates are affected
by the job ID with lower RPC rate.

And following is the results with the QoS patches (19729 + 19896):

1. Run dd along (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.572 s, 79.1 MB/s

2. Run dd/mydd at the same time on the same client (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 24.0809 s, 44.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 24.0959 s, 44.6 MB/s

3. Change the NRS policy to TBF jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
ost.OSS.ost_io.nrs_policies=tbf jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" 
ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10"
ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5"
ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5

4. Run dd/mydd/thdd along (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 58.6623 s, 18.3 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 103.291 s, 10.4 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 198.988 s, 5.4 MB/s

5. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 64.5303 s, 16.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 114.617 s, 9.4 MB/s

6. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 68.6446 s, 15.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 123.731 s, 8.7 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 219.711 s, 4.9 MB/s

As we can see from the results, the job IDs with different RPC rates got the
expected RPC rates. And they did't affect each other.

And also, when we run dd/dd2/dd3 on the same client, we can monitor the page
usage as well as the in flight RPCs used by each Job ID. And following is the
result. As we can see, the page cache and in flight RPCs are balanced between
job IDs.

[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88010b996400/osc_cache_class 
total: 8192, assigned: 8192, current time: 7514054940465(ns), reclaim time: 0, reclaim interval: 1000000000, in flight write RPC: 10, in flight read RPC: 0
job_id: "dd.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0
job_id: "dd2.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0
job_id: "dd3.0", used: 2730, max: 2730, reclaim time: 0(ns), in flight write RPC: 4, in flight read RPC: 0

Li Xi (Inactive) added a comment - 03/May/16 9:01 AM We've got encouraging results that the QoS patches finally work well. In order to check the function, we need to use NRS TBF policy on OSS. And following is the results without the QoS patches: 1. Run dd along (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 13.479 s, 79.7 MB/s 2. Run dd/mydd at the same time on the same client (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 24.6039 s, 43.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 25.3535 s, 42.4 MB/s 3. Run dd/mydd/thdd at the same time on the same client (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 31.3823 s, 34.2 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 32.4403 s, 33.1 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 34.4943 s, 31.1 MB/s 4. Change the NRS policy to TBF jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid" ost.OSS.ost_io.nrs_policies=tbf jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10" ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5" ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5 5. Run dd/mydd/thdd along (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 59.2141 s, 18.1 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 103.855 s, 10.3 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 199.384 s, 5.4 MB/s 6. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 118.265 s, 9.1 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 120.273 s, 8.9 MB/s 7. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 198.492 s, 5.4 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 204.857 s, 5.2 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 210.522 s, 5.1 MB/s As we can see from the results, the job IDs with higher RPC rates are affected by the job ID with lower RPC rate. And following is the results with the QoS patches (19729 + 19896): 1. Run dd along (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 13.572 s, 79.1 MB/s 2. Run dd/mydd at the same time on the same client (NRS policy: fifo) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 24.0809 s, 44.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 24.0959 s, 44.6 MB/s 3. Change the NRS policy to TBF jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid" ost.OSS.ost_io.nrs_policies=tbf jobid [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20" ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10" ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10 [root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5" ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5 4. Run dd/mydd/thdd along (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 58.6623 s, 18.3 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 103.291 s, 10.4 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 198.988 s, 5.4 MB/s 5. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 64.5303 s, 16.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 114.617 s, 9.4 MB/s 6. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid) [root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 68.6446 s, 15.6 MB/s [root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 123.731 s, 8.7 MB/s [root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 219.711 s, 4.9 MB/s As we can see from the results, the job IDs with different RPC rates got the expected RPC rates. And they did't affect each other. And also, when we run dd/dd2/dd3 on the same client, we can monitor the page usage as well as the in flight RPCs used by each Job ID. And following is the result. As we can see, the page cache and in flight RPCs are balanced between job IDs. [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88010b996400/osc_cache_class total: 8192, assigned: 8192, current time: 7514054940465(ns), reclaim time: 0, reclaim interval: 1000000000, in flight write RPC: 10, in flight read RPC: 0 job_id: "dd.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0 job_id: "dd2.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0 job_id: "dd3.0", used: 2730, max: 2730, reclaim time: 0(ns), in flight write RPC: 4, in flight read RPC: 0

Gerrit Updater added a comment - 01/May/16 2:52 PM

Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19896
Subject: LU-7982 osc: qos support for in flight RPC slot usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f67a81043213250ee818e7e6bfb920f8eaaba004

Gerrit Updater added a comment - 01/May/16 2:52 PM Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19896 Subject: LU-7982 osc: qos support for in flight RPC slot usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f67a81043213250ee818e7e6bfb920f8eaaba004

Gerrit Updater added a comment - 28/Apr/16 4:23 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19317/
Subject: LU-7982 libcfs: memory allocation without CPT for binheap
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cbe5b45a1d157c7345bd1352c257bee22ad8d085

Gerrit Updater added a comment - 28/Apr/16 4:23 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19317/ Subject: LU-7982 libcfs: memory allocation without CPT for binheap Project: fs/lustre-release Branch: master Current Patch Set: Commit: cbe5b45a1d157c7345bd1352c257bee22ad8d085

Li Xi (Inactive) added a comment - 22/Apr/16 3:10 PM

With the updated version (patch set 2) of 19729, all busy job IDs will balance their page usages much much more quickly than
before. And that makes me more confident with this design.

Li Xi (Inactive) added a comment - 22/Apr/16 3:10 PM With the updated version (patch set 2) of 19729, all busy job IDs will balance their page usages much much more quickly than before. And that makes me more confident with this design.

Li Xi (Inactive) added a comment - 22/Apr/16 7:51 AM

The patch 19729 tries to solve the same problem of 19700 in a different way. And it
has much more complex design, maybe too complex. However, it is able to balance
page cache usage between job IDs.

If the page cache usage usage is balance from the first begining, it will remain balanced
when all of thoese Job IDs has active I/Os:

[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297097
job_id: "dd.0", used: 2731, max: 2731, idle time: 0
job_id: "dd3.0", used: 2731, max: 2731, idle time: 0
job_id: "dd2.0", used: 2730, max: 2730, idle time: 0
[root@server9-Centos6-vm01 qos]# cat parallel.sh 
#!/bin/bash
THREADS=1
rm /mnt/lustre/* -f
for THREAD in `seq $THREADS`; do
        FILE1=/mnt/lustre/file1_$THREAD
        FILE2=/mnt/lustre/file2_$THREAD
        FILE3=/mnt/lustre/file3_$THREAD
        dd if=/dev/zero of=$FILE1 bs=1048576 count=10000 &
        dd2 if=/dev/zero of=$FILE2 bs=1048576 count=10000 &
        dd3 if=/dev/zero of=$FILE3 bs=1048576 count=10000 &
done
[root@server9-Centos6-vm01 qos]# sh parallel.sh
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297229
job_id: "dd.0", used: 0, max: 2731, idle time: 4297155
job_id: "dd3.0", used: 0, max: 2731, idle time: 4297155
job_id: "dd2.0", used: 0, max: 2730, idle time: 4297155

And then, if only one job IDs is active, it will reclaim all the page caches to itself:

[root@server9-Centos6-vm01 qos]# dd if=/dev/zero of=/mnt/lustre/file1 bs=1048576 count=10000
^C241+0 records in
241+0 records out
252706816 bytes (253 MB) copied, 6.35447 s, 39.8 MB/s

[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8185, current time: 4297294
job_id: "dd.0", used: 0, max: 2746, idle time: 4297292
job_id: "dd3.0", used: 0, max: 2716, idle time: 4297292
job_id: "dd2.0", used: 0, max: 2723, idle time: 4297290
[root@server9-Centos6-vm01 qos]# dd if=/dev/zero of=/mnt/lustre/file1 bs=1048576 count=10000&
[1] 2282
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297303
job_id: "dd.0", used: 2777, max: 2777, idle time: 0
job_id: "dd3.0", used: 0, max: 2700, idle time: 4297302
job_id: "dd2.0", used: 0, max: 2715, idle time: 4297302
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297304
job_id: "dd.0", used: 2777, max: 2777, idle time: 0
job_id: "dd3.0", used: 0, max: 2700, idle time: 4297302
job_id: "dd2.0", used: 0, max: 2715, idle time: 4297302
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297305
job_id: "dd.0", used: 2825, max: 2825, idle time: 0
job_id: "dd3.0", used: 0, max: 2668, idle time: 4297304
job_id: "dd2.0", used: 0, max: 2699, idle time: 4297304
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297307
job_id: "dd.0", used: 2921, max: 2921, idle time: 0
job_id: "dd3.0", used: 0, max: 2604, idle time: 4297306
job_id: "dd2.0", used: 0, max: 2667, idle time: 4297306
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297309
job_id: "dd.0", used: 3113, max: 3113, idle time: 0
job_id: "dd3.0", used: 0, max: 2476, idle time: 4297308
job_id: "dd2.0", used: 0, max: 2603, idle time: 4297308
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297311
job_id: "dd.0", used: 3497, max: 3497, idle time: 0
job_id: "dd3.0", used: 0, max: 2220, idle time: 4297310
job_id: "dd2.0", used: 0, max: 2475, idle time: 4297310
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297312
job_id: "dd.0", used: 4265, max: 4265, idle time: 0
job_id: "dd3.0", used: 0, max: 1708, idle time: 4297312
job_id: "dd2.0", used: 0, max: 2219, idle time: 4297312
[root@server9-Centos6-vm01 qos]# 
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297315
job_id: "dd.0", used: 5801, max: 5801, idle time: 0
job_id: "dd3.0", used: 0, max: 684, idle time: 4297314
job_id: "dd2.0", used: 0, max: 1707, idle time: 4297314
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297316
job_id: "dd.0", used: 5801, max: 5801, idle time: 0
job_id: "dd3.0", used: 0, max: 684, idle time: 4297314
job_id: "dd2.0", used: 0, max: 1707, idle time: 4297314
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297317
job_id: "dd.0", used: 7509, max: 7509, idle time: 0
job_id: "dd3.0", used: 0, max: 0, idle time: 0
job_id: "dd2.0", used: 0, max: 683, idle time: 4297316
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297319
job_id: "dd.0", used: 8192, max: 8192, idle time: 0
job_id: "dd3.0", used: 0, max: 0, idle time: 0
job_id: "dd2.0", used: 0, max: 0, idle time: 0

And then if all Job IDs start I/O again, the page cache will again be balanced slowly again:

[root@server9-Centos6-vm01 qos]# sh parallel.sh 
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297447
job_id: "dd.0", used: 8063, max: 8063, idle time: 0
job_id: "dd3.0", used: 65, max: 65, idle time: 0
job_id: "dd2.0", used: 64, max: 64, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297454
job_id: "dd.0", used: 7791, max: 7791, idle time: 0
job_id: "dd3.0", used: 201, max: 201, idle time: 0
job_id: "dd2.0", used: 200, max: 200, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297455
job_id: "dd.0", used: 6400, max: 7728, idle time: 4297455
job_id: "dd3.0", used: 232, max: 232, idle time: 0
job_id: "dd2.0", used: 232, max: 232, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297474
job_id: "dd.0", used: 7161, max: 7161, idle time: 0
job_id: "dd3.0", used: 516, max: 516, idle time: 0
job_id: "dd2.0", used: 515, max: 515, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297511
job_id: "dd.0", used: 0, max: 6446, idle time: 4297503
job_id: "dd3.0", used: 0, max: 872, idle time: 4297503
job_id: "dd2.0", used: 0, max: 874, idle time: 4297504
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297566
job_id: "dd.0", used: 5694, max: 5694, idle time: 0
job_id: "dd3.0", used: 1249, max: 1249, idle time: 0
job_id: "dd2.0", used: 1249, max: 1249, idle time: 0
[root@server9-Centos6-vm01 qos]# sh parallel.sh 
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297601
job_id: "dd.0", used: 5306, max: 5306, idle time: 0
job_id: "dd3.0", used: 1442, max: 1442, idle time: 0
job_id: "dd2.0", used: 1444, max: 1444, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297674
job_id: "dd.0", used: 0, max: 4570, idle time: 4297652
job_id: "dd3.0", used: 0, max: 1809, idle time: 4297653
job_id: "dd2.0", used: 0, max: 1813, idle time: 4297653
[root@server9-Centos6-vm01 qos]# sh parallel.sh 
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297771
job_id: "dd.0", used: 0, max: 3751, idle time: 4297753
job_id: "dd3.0", used: 0, max: 2221, idle time: 4297753
job_id: "dd2.0", used: 0, max: 2220, idle time: 4297753
[root@server9-Centos6-vm01 qos]# sh parallel.sh 
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297798
job_id: "dd.0", used: 3719, max: 3719, idle time: 0
job_id: "dd3.0", used: 2237, max: 2237, idle time: 0
job_id: "dd2.0", used: 2236, max: 2236, idle time: 0
[root@server9-Centos6-vm01 qos]# sh parallel.sh 
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297870
job_id: "dd.0", used: 2886, max: 2886, idle time: 0
job_id: "dd3.0", used: 2653, max: 2653, idle time: 0
job_id: "dd2.0", used: 2653, max: 2653, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297885
job_id: "dd.0", used: 2731, max: 2731, idle time: 0
job_id: "dd3.0", used: 2731, max: 2731, idle time: 0
job_id: "dd2.0", used: 2730, max: 2730, idle time: 0
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class 
total: 8192, assigned: 8192, current time: 4297893
job_id: "dd.0", used: 2730, max: 2730, idle time: 0
job_id: "dd3.0", used: 2731, max: 2731, idle time: 0
job_id: "dd2.0", used: 2731, max: 2731, idle time: 0

As you can see, the balance process is very slow, because the busy Job ID will space one of its pages when one RPC finishes.
This could be optimized in the future to speed up the balance process though.

Li Xi (Inactive) added a comment - 22/Apr/16 7:51 AM The patch 19729 tries to solve the same problem of 19700 in a different way. And it has much more complex design, maybe too complex. However, it is able to balance page cache usage between job IDs. If the page cache usage usage is balance from the first begining, it will remain balanced when all of thoese Job IDs has active I/Os: [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297097 job_id: "dd.0", used: 2731, max: 2731, idle time: 0 job_id: "dd3.0", used: 2731, max: 2731, idle time: 0 job_id: "dd2.0", used: 2730, max: 2730, idle time: 0 [root@server9-Centos6-vm01 qos]# cat parallel.sh #!/bin/bash THREADS=1 rm /mnt/lustre/* -f for THREAD in `seq $THREADS`; do FILE1=/mnt/lustre/file1_$THREAD FILE2=/mnt/lustre/file2_$THREAD FILE3=/mnt/lustre/file3_$THREAD dd if=/dev/zero of=$FILE1 bs=1048576 count=10000 & dd2 if=/dev/zero of=$FILE2 bs=1048576 count=10000 & dd3 if=/dev/zero of=$FILE3 bs=1048576 count=10000 & done [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297229 job_id: "dd.0", used: 0, max: 2731, idle time: 4297155 job_id: "dd3.0", used: 0, max: 2731, idle time: 4297155 job_id: "dd2.0", used: 0, max: 2730, idle time: 4297155 And then, if only one job IDs is active, it will reclaim all the page caches to itself: [root@server9-Centos6-vm01 qos]# dd if=/dev/zero of=/mnt/lustre/file1 bs=1048576 count=10000 ^C241+0 records in 241+0 records out 252706816 bytes (253 MB) copied, 6.35447 s, 39.8 MB/s [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8185, current time: 4297294 job_id: "dd.0", used: 0, max: 2746, idle time: 4297292 job_id: "dd3.0", used: 0, max: 2716, idle time: 4297292 job_id: "dd2.0", used: 0, max: 2723, idle time: 4297290 [root@server9-Centos6-vm01 qos]# dd if=/dev/zero of=/mnt/lustre/file1 bs=1048576 count=10000& [1] 2282 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297303 job_id: "dd.0", used: 2777, max: 2777, idle time: 0 job_id: "dd3.0", used: 0, max: 2700, idle time: 4297302 job_id: "dd2.0", used: 0, max: 2715, idle time: 4297302 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297304 job_id: "dd.0", used: 2777, max: 2777, idle time: 0 job_id: "dd3.0", used: 0, max: 2700, idle time: 4297302 job_id: "dd2.0", used: 0, max: 2715, idle time: 4297302 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297305 job_id: "dd.0", used: 2825, max: 2825, idle time: 0 job_id: "dd3.0", used: 0, max: 2668, idle time: 4297304 job_id: "dd2.0", used: 0, max: 2699, idle time: 4297304 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297307 job_id: "dd.0", used: 2921, max: 2921, idle time: 0 job_id: "dd3.0", used: 0, max: 2604, idle time: 4297306 job_id: "dd2.0", used: 0, max: 2667, idle time: 4297306 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297309 job_id: "dd.0", used: 3113, max: 3113, idle time: 0 job_id: "dd3.0", used: 0, max: 2476, idle time: 4297308 job_id: "dd2.0", used: 0, max: 2603, idle time: 4297308 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297311 job_id: "dd.0", used: 3497, max: 3497, idle time: 0 job_id: "dd3.0", used: 0, max: 2220, idle time: 4297310 job_id: "dd2.0", used: 0, max: 2475, idle time: 4297310 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297312 job_id: "dd.0", used: 4265, max: 4265, idle time: 0 job_id: "dd3.0", used: 0, max: 1708, idle time: 4297312 job_id: "dd2.0", used: 0, max: 2219, idle time: 4297312 [root@server9-Centos6-vm01 qos]# [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297315 job_id: "dd.0", used: 5801, max: 5801, idle time: 0 job_id: "dd3.0", used: 0, max: 684, idle time: 4297314 job_id: "dd2.0", used: 0, max: 1707, idle time: 4297314 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297316 job_id: "dd.0", used: 5801, max: 5801, idle time: 0 job_id: "dd3.0", used: 0, max: 684, idle time: 4297314 job_id: "dd2.0", used: 0, max: 1707, idle time: 4297314 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297317 job_id: "dd.0", used: 7509, max: 7509, idle time: 0 job_id: "dd3.0", used: 0, max: 0, idle time: 0 job_id: "dd2.0", used: 0, max: 683, idle time: 4297316 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297319 job_id: "dd.0", used: 8192, max: 8192, idle time: 0 job_id: "dd3.0", used: 0, max: 0, idle time: 0 job_id: "dd2.0", used: 0, max: 0, idle time: 0 And then if all Job IDs start I/O again, the page cache will again be balanced slowly again: [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297447 job_id: "dd.0", used: 8063, max: 8063, idle time: 0 job_id: "dd3.0", used: 65, max: 65, idle time: 0 job_id: "dd2.0", used: 64, max: 64, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297454 job_id: "dd.0", used: 7791, max: 7791, idle time: 0 job_id: "dd3.0", used: 201, max: 201, idle time: 0 job_id: "dd2.0", used: 200, max: 200, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297455 job_id: "dd.0", used: 6400, max: 7728, idle time: 4297455 job_id: "dd3.0", used: 232, max: 232, idle time: 0 job_id: "dd2.0", used: 232, max: 232, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297474 job_id: "dd.0", used: 7161, max: 7161, idle time: 0 job_id: "dd3.0", used: 516, max: 516, idle time: 0 job_id: "dd2.0", used: 515, max: 515, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297511 job_id: "dd.0", used: 0, max: 6446, idle time: 4297503 job_id: "dd3.0", used: 0, max: 872, idle time: 4297503 job_id: "dd2.0", used: 0, max: 874, idle time: 4297504 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297566 job_id: "dd.0", used: 5694, max: 5694, idle time: 0 job_id: "dd3.0", used: 1249, max: 1249, idle time: 0 job_id: "dd2.0", used: 1249, max: 1249, idle time: 0 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297601 job_id: "dd.0", used: 5306, max: 5306, idle time: 0 job_id: "dd3.0", used: 1442, max: 1442, idle time: 0 job_id: "dd2.0", used: 1444, max: 1444, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297674 job_id: "dd.0", used: 0, max: 4570, idle time: 4297652 job_id: "dd3.0", used: 0, max: 1809, idle time: 4297653 job_id: "dd2.0", used: 0, max: 1813, idle time: 4297653 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297771 job_id: "dd.0", used: 0, max: 3751, idle time: 4297753 job_id: "dd3.0", used: 0, max: 2221, idle time: 4297753 job_id: "dd2.0", used: 0, max: 2220, idle time: 4297753 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297798 job_id: "dd.0", used: 3719, max: 3719, idle time: 0 job_id: "dd3.0", used: 2237, max: 2237, idle time: 0 job_id: "dd2.0", used: 2236, max: 2236, idle time: 0 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297870 job_id: "dd.0", used: 2886, max: 2886, idle time: 0 job_id: "dd3.0", used: 2653, max: 2653, idle time: 0 job_id: "dd2.0", used: 2653, max: 2653, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297885 job_id: "dd.0", used: 2731, max: 2731, idle time: 0 job_id: "dd3.0", used: 2731, max: 2731, idle time: 0 job_id: "dd2.0", used: 2730, max: 2730, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297893 job_id: "dd.0", used: 2730, max: 2730, idle time: 0 job_id: "dd3.0", used: 2731, max: 2731, idle time: 0 job_id: "dd2.0", used: 2731, max: 2731, idle time: 0 As you can see, the balance process is very slow, because the busy Job ID will space one of its pages when one RPC finishes. This could be optimized in the future to speed up the balance process though.

People

Assignee:: Li Xi

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 04/Apr/16 7:30 PM

Updated:: 18/Sep/18 3:29 AM