[LU-3277] LU-2139 may cause the performance regression Created: 05/May/13 Updated: 11/Sep/13 Resolved: 11/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Prakash Surya (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
RHEL6.3 and current master |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 2 | ||||||||||||||||||||
| Rank (Obsolete): | 8114 | ||||||||||||||||||||
| Description |
|
There is a performance regression on the current master(c864582b5d4541c7830d628457e55cd859aee005) if we have multiple IOR threads per client. As far as I can test, client : commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c # mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Sun May 5 12:24:09 2013 Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64 Using synchronized MPI timer Start time skew across all tasks: 0.00 sec Path: /lustre/ior.out FS: 683.5 TiB Used FS: 0.0% Inodes: 5.0 Mi Used Inodes: 0.0% Participating tasks: 8 Using reorderTasks '-C' (expecting block, not cyclic, task assignment) task 0 on s08 task 1 on s08 task 2 on s08 task 3 on s08 task 4 on s08 task 5 on s08 task 6 on s08 task 7 on s08 Summary: api = POSIX test filename = /lustre/ior.out/file access = file-per-process pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 8 (8 per node) repetitions = 1 xfersize = 1 MiB blocksize = 8 GiB aggregate filesize = 64 GiB Using Time Stamp 1367781849 (0x5186b1d9) for Data Signature Commencing write performance test. Sun May 5 12:24:09 2013 access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 3228.38 8388608 1024.00 0.001871 20.30 1.34 20.30 0 XXCEL Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 3228.38 3228.38 3228.38 0.00 3228.38 3228.38 3228.38 0.00 20.29996 8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL Max Write: 3228.38 MiB/sec (3385.20 MB/sec) Run finished: Sun May 5 12:24:30 2013 client : commit 5661651b2cc6414686e7da581589c2ea0e1f1969 # mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Sun May 5 12:16:35 2013 Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64 Using synchronized MPI timer Start time skew across all tasks: 0.00 sec Path: /lustre/ior.out FS: 683.5 TiB Used FS: 0.0% Inodes: 5.0 Mi Used Inodes: 0.0% Participating tasks: 8 Using reorderTasks '-C' (expecting block, not cyclic, task assignment) task 0 on s08 task 1 on s08 task 2 on s08 task 3 on s08 task 4 on s08 task 5 on s08 task 6 on s08 task 7 on s08 Summary: api = POSIX test filename = /lustre/ior.out/file access = file-per-process pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 8 (8 per node) repetitions = 1 xfersize = 1 MiB blocksize = 8 GiB aggregate filesize = 64 GiB Using Time Stamp 1367781395 (0x5186b013) for Data Signature Commencing write performance test. Sun May 5 12:16:35 2013 access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 550.28 8388608 1024.00 0.001730 119.10 2.76 119.10 0 XXCEL Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 550.28 550.28 550.28 0.00 550.28 550.28 550.28 0.00 119.09623 8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL Max Write: 550.28 MiB/sec (577.01 MB/sec) Run finished: Sun May 5 12:18:34 2013 Both tests, the servers are running current master (c864582b5d4541c7830d628457e55cd859aee005) |
| Comments |
| Comment by Shuichi Ihara (Inactive) [ 05/May/13 ] |
|
reverted commit 5661651b2cc6414686e7da581589c2ea0e1f1969 from current master (c864582b5d4541c7830d628457e55cd859aee005) and ran IOR again and I saw similar results as commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c. IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Sun May 5 12:41:11 2013 Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64 Using synchronized MPI timer Start time skew across all tasks: 0.00 sec Path: /lustre/ior.out FS: 683.5 TiB Used FS: 0.0% Inodes: 5.0 Mi Used Inodes: 0.0% Participating tasks: 8 Using reorderTasks '-C' (expecting block, not cyclic, task assignment) task 0 on s08 task 1 on s08 task 2 on s08 task 3 on s08 task 4 on s08 task 5 on s08 task 6 on s08 task 7 on s08 Summary: api = POSIX test filename = /lustre/ior.out/file access = file-per-process pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 8 (8 per node) repetitions = 1 xfersize = 1 MiB blocksize = 8 GiB aggregate filesize = 64 GiB Using Time Stamp 1367782871 (0x5186b5d7) for Data Signature Commencing write performance test. Sun May 5 12:41:11 2013 access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 3143.64 8388608 1024.00 0.001566 20.85 1.97 20.85 0 XXCEL Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 3143.64 3143.64 3143.64 0.00 3143.64 3143.64 3143.64 0.00 20.84715 8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL Max Write: 3143.64 MiB/sec (3296.35 MB/sec) Run finished: Sun May 5 12:41:32 2013 |
| Comment by Niu Yawei (Inactive) [ 06/May/13 ] |
|
The commit 5661651b2cc6414686e7da581589c2ea0e1f1969 is from @@ -1463,7 +1465,8 @@ static int osc_enter_cache_try(struct client_obd *cli, return 0; if (cli->cl_dirty + CFS_PAGE_SIZE <= cli->cl_dirty_max && - cfs_atomic_read(&obd_dirty_pages) + 1 <= obd_max_dirty_pages) { + cfs_atomic_read(&obd_unstable_pages) + 1 + + cfs_atomic_read(&obd_dirty_pages) <= obd_max_dirty_pages) { osc_consume_write_grant(cli, &oap->oap_brw_page); if (transient) { cli->cl_dirty_transit += CFS_PAGE_SIZE; @@ -1576,9 +1579,9 @@ void osc_wake_cache_waiters(struct client_obd *cli) ocw->ocw_rc = -EDQUOT; /* we can't dirty more */ - if ((cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max) || - (cfs_atomic_read(&obd_dirty_pages) + 1 > - obd_max_dirty_pages)) { + if (cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max || + cfs_atomic_read(&obd_unstable_pages) + 1 + + cfs_atomic_read(&obd_dirty_pages) > obd_max_dirty_pages) { CDEBUG(D_CACHE, "no dirty room: dirty: %ld " "osc max %ld, sys max %d\n", cli->cl_dirty, cli->cl_dirty_max, obd_max_dirty_pages); I think the title of this ticket should be changed as " |
| Comment by Jodi Levi (Inactive) [ 06/May/13 ] |
|
Oleg will revert the patch in |
| Comment by Peter Jones [ 06/May/13 ] |
|
The patch has now been reverted - Ihara can you confirm whether the problem has now disappeared? |
| Comment by Prakash Surya (Inactive) [ 06/May/13 ] |
|
How much memory is there on the client? And what is the commit frequency on the servers? I would expect the performance to be worse with the patch Niu points to if the client has sufficient bandwidth (combined with async server commits) to fill it's available dirty page space with unstable pages. So this performance regression might be "working as intended", but that all depends on how many unstable pages the client is exhausting during the test. Can you sample lctl get_param 'llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo a few times while the test is running to give me an idea of the number of unstable pages are being consumed? If this value is anywhere near the limit set in /proc/sys/lustre/max_dirty_mb then maybe we need to rethink the default value of max_dirty_mb and set it to something larger. |
| Comment by Shuichi Ihara (Inactive) [ 08/May/13 ] |
|
Peter, yes, the performance is back with the latest commit (2.3.65). However, I'm hitting another issue with IOR.. will figure out and if it's different problem, I will open new ticket. |
| Comment by Shuichi Ihara (Inactive) [ 08/May/13 ] |
|
Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. I also collected collectl log and found out the client was not writing always. it's doing like this writing; idle, writing; idle;... our client's memory is 64GB and tested larger max_dirty_mb(tested more than 3/4 of memory size from default 1/2), but it didn't help either. |
| Comment by Prakash Surya (Inactive) [ 08/May/13 ] |
Thanks, I'll give these a look.
OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain.. So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages, balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk. You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages. What do these files show on you system: /proc/sys/vm/dirty_background_bytes, /proc/sys/vm/dirty_background_ratio, /proc/sys/vm/dirty_bytes, /proc/sys/vm/dirty_ratio? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory, if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit. If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test. |
| Comment by Jodi Levi (Inactive) [ 08/May/13 ] |
|
Lowering priority as |
| Comment by Prakash Surya (Inactive) [ 08/May/13 ] |
|
In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment: fio S 00000fffae72633c 0 59338 59283 0x00000000
Call Trace:
[c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
[c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
[c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
[c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
[c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
[c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
[c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
[c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
[c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
[c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
[c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
[c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
[c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
[c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
[c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
[c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
[c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
[c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c
|
| Comment by Prakash Surya (Inactive) [ 11/Sep/13 ] |
|
Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so. General discussion of the |