[LU-7982] Client side QoS based on jobid Created: 04/Apr/16 Updated: 18/Sep/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Li Xi (Inactive) | Assignee: | Li Xi |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | cea, patch | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Lustre has server side QoS mechanism based on NRS TBF policy ( In order to prevent this from happening, a client side mechanism needs to be added to make the RPC sending chechanism at least more fair for all jobs. |
| Comments |
| Comment by Gerrit Updater [ 04/Apr/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19317 |
| Comment by Gerrit Updater [ 04/Apr/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19319 |
| Comment by Peter Jones [ 05/Apr/16 ] |
|
Emoly Could you please review these patches? Thanks Peter |
| Comment by Gerrit Updater [ 21/Apr/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19700 |
| Comment by Li Xi (Inactive) [ 21/Apr/16 ] |
|
The patch 19319 is trying to solve the problem of in-flight-RPC limitation. Let's assume two processes with two different job IDs (Job1 and Job2) are However, the actual result is that both processes have RPC rate of R2 because With that patch, since the in-flight-RPC is balanced between jobs, the behavior of The patch 19700 is trying to solve this new problem. If both Job1 and Job2 are Let's assume all the pages are occupied. Most of the time, the processes of |
| Comment by Gerrit Updater [ 22/Apr/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19729 |
| Comment by Li Xi (Inactive) [ 22/Apr/16 ] |
|
The patch 19729 tries to solve the same problem of 19700 in a different way. And it If the page cache usage usage is balance from the first begining, it will remain balanced [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class
total: 8192, assigned: 8192, current time: 4297097
job_id: "dd.0", used: 2731, max: 2731, idle time: 0
job_id: "dd3.0", used: 2731, max: 2731, idle time: 0
job_id: "dd2.0", used: 2730, max: 2730, idle time: 0
[root@server9-Centos6-vm01 qos]# cat parallel.sh
#!/bin/bash
THREADS=1
rm /mnt/lustre/* -f
for THREAD in `seq $THREADS`; do
FILE1=/mnt/lustre/file1_$THREAD
FILE2=/mnt/lustre/file2_$THREAD
FILE3=/mnt/lustre/file3_$THREAD
dd if=/dev/zero of=$FILE1 bs=1048576 count=10000 &
dd2 if=/dev/zero of=$FILE2 bs=1048576 count=10000 &
dd3 if=/dev/zero of=$FILE3 bs=1048576 count=10000 &
done
[root@server9-Centos6-vm01 qos]# sh parallel.sh
[root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class
total: 8192, assigned: 8192, current time: 4297229
job_id: "dd.0", used: 0, max: 2731, idle time: 4297155
job_id: "dd3.0", used: 0, max: 2731, idle time: 4297155
job_id: "dd2.0", used: 0, max: 2730, idle time: 4297155
And then, if only one job IDs is active, it will reclaim all the page caches to itself: [root@server9-Centos6-vm01 qos]# dd if=/dev/zero of=/mnt/lustre/file1 bs=1048576 count=10000 ^C241+0 records in 241+0 records out 252706816 bytes (253 MB) copied, 6.35447 s, 39.8 MB/s [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8185, current time: 4297294 job_id: "dd.0", used: 0, max: 2746, idle time: 4297292 job_id: "dd3.0", used: 0, max: 2716, idle time: 4297292 job_id: "dd2.0", used: 0, max: 2723, idle time: 4297290 [root@server9-Centos6-vm01 qos]# dd if=/dev/zero of=/mnt/lustre/file1 bs=1048576 count=10000& [1] 2282 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297303 job_id: "dd.0", used: 2777, max: 2777, idle time: 0 job_id: "dd3.0", used: 0, max: 2700, idle time: 4297302 job_id: "dd2.0", used: 0, max: 2715, idle time: 4297302 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297304 job_id: "dd.0", used: 2777, max: 2777, idle time: 0 job_id: "dd3.0", used: 0, max: 2700, idle time: 4297302 job_id: "dd2.0", used: 0, max: 2715, idle time: 4297302 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297305 job_id: "dd.0", used: 2825, max: 2825, idle time: 0 job_id: "dd3.0", used: 0, max: 2668, idle time: 4297304 job_id: "dd2.0", used: 0, max: 2699, idle time: 4297304 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297307 job_id: "dd.0", used: 2921, max: 2921, idle time: 0 job_id: "dd3.0", used: 0, max: 2604, idle time: 4297306 job_id: "dd2.0", used: 0, max: 2667, idle time: 4297306 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297309 job_id: "dd.0", used: 3113, max: 3113, idle time: 0 job_id: "dd3.0", used: 0, max: 2476, idle time: 4297308 job_id: "dd2.0", used: 0, max: 2603, idle time: 4297308 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297311 job_id: "dd.0", used: 3497, max: 3497, idle time: 0 job_id: "dd3.0", used: 0, max: 2220, idle time: 4297310 job_id: "dd2.0", used: 0, max: 2475, idle time: 4297310 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297312 job_id: "dd.0", used: 4265, max: 4265, idle time: 0 job_id: "dd3.0", used: 0, max: 1708, idle time: 4297312 job_id: "dd2.0", used: 0, max: 2219, idle time: 4297312 [root@server9-Centos6-vm01 qos]# [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297315 job_id: "dd.0", used: 5801, max: 5801, idle time: 0 job_id: "dd3.0", used: 0, max: 684, idle time: 4297314 job_id: "dd2.0", used: 0, max: 1707, idle time: 4297314 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297316 job_id: "dd.0", used: 5801, max: 5801, idle time: 0 job_id: "dd3.0", used: 0, max: 684, idle time: 4297314 job_id: "dd2.0", used: 0, max: 1707, idle time: 4297314 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297317 job_id: "dd.0", used: 7509, max: 7509, idle time: 0 job_id: "dd3.0", used: 0, max: 0, idle time: 0 job_id: "dd2.0", used: 0, max: 683, idle time: 4297316 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297319 job_id: "dd.0", used: 8192, max: 8192, idle time: 0 job_id: "dd3.0", used: 0, max: 0, idle time: 0 job_id: "dd2.0", used: 0, max: 0, idle time: 0 And then if all Job IDs start I/O again, the page cache will again be balanced slowly again: [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297447 job_id: "dd.0", used: 8063, max: 8063, idle time: 0 job_id: "dd3.0", used: 65, max: 65, idle time: 0 job_id: "dd2.0", used: 64, max: 64, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297454 job_id: "dd.0", used: 7791, max: 7791, idle time: 0 job_id: "dd3.0", used: 201, max: 201, idle time: 0 job_id: "dd2.0", used: 200, max: 200, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297455 job_id: "dd.0", used: 6400, max: 7728, idle time: 4297455 job_id: "dd3.0", used: 232, max: 232, idle time: 0 job_id: "dd2.0", used: 232, max: 232, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297474 job_id: "dd.0", used: 7161, max: 7161, idle time: 0 job_id: "dd3.0", used: 516, max: 516, idle time: 0 job_id: "dd2.0", used: 515, max: 515, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297511 job_id: "dd.0", used: 0, max: 6446, idle time: 4297503 job_id: "dd3.0", used: 0, max: 872, idle time: 4297503 job_id: "dd2.0", used: 0, max: 874, idle time: 4297504 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297566 job_id: "dd.0", used: 5694, max: 5694, idle time: 0 job_id: "dd3.0", used: 1249, max: 1249, idle time: 0 job_id: "dd2.0", used: 1249, max: 1249, idle time: 0 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297601 job_id: "dd.0", used: 5306, max: 5306, idle time: 0 job_id: "dd3.0", used: 1442, max: 1442, idle time: 0 job_id: "dd2.0", used: 1444, max: 1444, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297674 job_id: "dd.0", used: 0, max: 4570, idle time: 4297652 job_id: "dd3.0", used: 0, max: 1809, idle time: 4297653 job_id: "dd2.0", used: 0, max: 1813, idle time: 4297653 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297771 job_id: "dd.0", used: 0, max: 3751, idle time: 4297753 job_id: "dd3.0", used: 0, max: 2221, idle time: 4297753 job_id: "dd2.0", used: 0, max: 2220, idle time: 4297753 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297798 job_id: "dd.0", used: 3719, max: 3719, idle time: 0 job_id: "dd3.0", used: 2237, max: 2237, idle time: 0 job_id: "dd2.0", used: 2236, max: 2236, idle time: 0 [root@server9-Centos6-vm01 qos]# sh parallel.sh [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297870 job_id: "dd.0", used: 2886, max: 2886, idle time: 0 job_id: "dd3.0", used: 2653, max: 2653, idle time: 0 job_id: "dd2.0", used: 2653, max: 2653, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297885 job_id: "dd.0", used: 2731, max: 2731, idle time: 0 job_id: "dd3.0", used: 2731, max: 2731, idle time: 0 job_id: "dd2.0", used: 2730, max: 2730, idle time: 0 [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff880109d20c00/osc_cache_class total: 8192, assigned: 8192, current time: 4297893 job_id: "dd.0", used: 2730, max: 2730, idle time: 0 job_id: "dd3.0", used: 2731, max: 2731, idle time: 0 job_id: "dd2.0", used: 2731, max: 2731, idle time: 0 As you can see, the balance process is very slow, because the busy Job ID will space one of its pages when one RPC finishes. |
| Comment by Li Xi (Inactive) [ 22/Apr/16 ] |
|
With the updated version (patch set 2) of 19729, all busy job IDs will balance their page usages much much more quickly than |
| Comment by Gerrit Updater [ 28/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19317/ |
| Comment by Gerrit Updater [ 01/May/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/19896 |
| Comment by Li Xi (Inactive) [ 03/May/16 ] |
|
We've got encouraging results that the QoS patches finally work well. In order And following is the results without the QoS patches: 1. Run dd along (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.479 s, 79.7 MB/s
2. Run dd/mydd at the same time on the same client (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 24.6039 s, 43.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 25.3535 s, 42.4 MB/s
3. Run dd/mydd/thdd at the same time on the same client (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.3823 s, 34.2 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 32.4403 s, 33.1 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 34.4943 s, 31.1 MB/s
4. Change the NRS policy to TBF jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
ost.OSS.ost_io.nrs_policies=tbf jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20"
ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10"
ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5"
ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5
5. Run dd/mydd/thdd along (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 59.2141 s, 18.1 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 103.855 s, 10.3 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 199.384 s, 5.4 MB/s
6. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 118.265 s, 9.1 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 120.273 s, 8.9 MB/s
7. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 198.492 s, 5.4 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 204.857 s, 5.2 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 210.522 s, 5.1 MB/s
As we can see from the results, the job IDs with higher RPC rates are affected And following is the results with the QoS patches (19729 + 19896): 1. Run dd along (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.572 s, 79.1 MB/s
2. Run dd/mydd at the same time on the same client (NRS policy: fifo)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 24.0809 s, 44.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 24.0959 s, 44.6 MB/s
3. Change the NRS policy to TBF jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
ost.OSS.ost_io.nrs_policies=tbf jobid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start A {dd.0} 20"
ost.OSS.ost_io.nrs_tbf_rule=start A {dd.0} 20
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start B {mydd.0} 10"
ost.OSS.ost_io.nrs_tbf_rule=start B {mydd.0} 10
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start C {thdd.0} 5"
ost.OSS.ost_io.nrs_tbf_rule=start C {thdd.0} 5
4. Run dd/mydd/thdd along (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 58.6623 s, 18.3 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 103.291 s, 10.4 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 198.988 s, 5.4 MB/s
5. Run dd/mydd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 64.5303 s, 16.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 114.617 s, 9.4 MB/s
6. Run dd/mydd/thdd at the same time on the same client (NRS policy: TBF jobid)
[root@QYJ home]# dd if=/dev/zero of=/mnt/lustre/t1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 68.6446 s, 15.6 MB/s
[root@QYJ Desktop]# mydd if=/dev/zero of=/mnt/lustre/t2 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 123.731 s, 8.7 MB/s
[root@QYJ Desktop]# thdd if=/dev/zero of=/mnt/lustre/t3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 219.711 s, 4.9 MB/s
As we can see from the results, the job IDs with different RPC rates got the And also, when we run dd/dd2/dd3 on the same client, we can monitor the page [root@server9-Centos6-vm01 qos]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88010b996400/osc_cache_class total: 8192, assigned: 8192, current time: 7514054940465(ns), reclaim time: 0, reclaim interval: 1000000000, in flight write RPC: 10, in flight read RPC: 0 job_id: "dd.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0 job_id: "dd2.0", used: 2731, max: 2731, reclaim time: 0(ns), in flight write RPC: 3, in flight read RPC: 0 job_id: "dd3.0", used: 2730, max: 2730, reclaim time: 0(ns), in flight write RPC: 4, in flight read RPC: 0 |
| Comment by Li Xi (Inactive) [ 16/May/16 ] |
|
Anybody has some time to review these two patches? http://review.whamcloud.com/#/c/19729/ And we are going to working on cgroup support. The current QoS is based on JobID. Since the cgroup path of a task can be got by task_cgroup_path(), we should be able to add cgroup support for QoS easily, like what we did with NRS TBF jobID/NID. |
| Comment by Li Xi (Inactive) [ 16/May/16 ] |
|
I am wondering whether we could add cgroup support for JobID. For example, if obd_jobid_var is "cgroup_path", we use the path of cgroup as JobID. This prevents duplicated codes, and also enable us to implement cgroup support for QoS on the server side in the future. |