[LU-3277] LU-2139 may cause the performance regression Created: 05/May/13  Updated: 11/Sep/13  Resolved: 11/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara (Inactive) Assignee: Prakash Surya (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

RHEL6.3 and current master


Attachments: Text File collectl.log     Text File stat.log    
Issue Links:
Blocker
is blocking LU-2139 Tracking unstable pages Resolved
Related
is related to LU-2576 Hangs in osc_enter_cache due to dirty... Resolved
is related to LU-2139 Tracking unstable pages Resolved
Severity: 2
Rank (Obsolete): 8114

 Description   

There is a performance regression on the current master(c864582b5d4541c7830d628457e55cd859aee005) if we have multiple IOR threads per client. As far as I can test, LU-2576 might cause this performance regression. Here is quick test results on each commit.

client : commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c

# mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Sun May  5 12:24:09 2013
Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
Using synchronized MPI timer
Start time skew across all tasks: 0.00 sec
Path: /lustre/ior.out
FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
Participating tasks: 8
Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
task 0 on s08
task 1 on s08
task 2 on s08
task 3 on s08
task 4 on s08
task 5 on s08
task 6 on s08
task 7 on s08

Summary:
	api                = POSIX
	test filename      = /lustre/ior.out/file
	access             = file-per-process
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file=constant task offsets = 1
	clients            = 8 (8 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 8 GiB
	aggregate filesize = 64 GiB

Using Time Stamp 1367781849 (0x5186b1d9) for Data Signature
Commencing write performance test.
Sun May  5 12:24:09 2013

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
------    ---------  ---------- ---------  --------   --------   --------  --------   ----
write     3228.38    8388608    1024.00    0.001871   20.30      1.34       20.30      0    XXCEL
Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write        3228.38    3228.38     3228.38      0.00    3228.38    3228.38     3228.38      0.00  20.29996   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL

Max Write: 3228.38 MiB/sec (3385.20 MB/sec)

Run finished: Sun May  5 12:24:30 2013

client : commit 5661651b2cc6414686e7da581589c2ea0e1f1969

# mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Sun May  5 12:16:35 2013
Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
Using synchronized MPI timer
Start time skew across all tasks: 0.00 sec
Path: /lustre/ior.out
FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
Participating tasks: 8
Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
task 0 on s08
task 1 on s08
task 2 on s08
task 3 on s08
task 4 on s08
task 5 on s08
task 6 on s08
task 7 on s08

Summary:
	api                = POSIX
	test filename      = /lustre/ior.out/file
	access             = file-per-process
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file=constant task offsets = 1
	clients            = 8 (8 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 8 GiB
	aggregate filesize = 64 GiB

Using Time Stamp 1367781395 (0x5186b013) for Data Signature
Commencing write performance test.
Sun May  5 12:16:35 2013

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
------    ---------  ---------- ---------  --------   --------   --------  --------   ----
write     550.28     8388608    1024.00    0.001730   119.10     2.76       119.10     0    XXCEL
Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write         550.28     550.28      550.28      0.00     550.28     550.28      550.28      0.00 119.09623   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL

Max Write: 550.28 MiB/sec (577.01 MB/sec)

Run finished: Sun May  5 12:18:34 2013

Both tests, the servers are running current master (c864582b5d4541c7830d628457e55cd859aee005)



 Comments   
Comment by Shuichi Ihara (Inactive) [ 05/May/13 ]

reverted commit 5661651b2cc6414686e7da581589c2ea0e1f1969 from current master (c864582b5d4541c7830d628457e55cd859aee005) and ran IOR again and I saw similar results as commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c.

IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Sun May  5 12:41:11 2013
Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
Using synchronized MPI timer
Start time skew across all tasks: 0.00 sec
Path: /lustre/ior.out
FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
Participating tasks: 8
Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
task 0 on s08
task 1 on s08
task 2 on s08
task 3 on s08
task 4 on s08
task 5 on s08
task 6 on s08
task 7 on s08

Summary:
	api                = POSIX
	test filename      = /lustre/ior.out/file
	access             = file-per-process
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file=constant task offsets = 1
	clients            = 8 (8 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 8 GiB
	aggregate filesize = 64 GiB

Using Time Stamp 1367782871 (0x5186b5d7) for Data Signature
Commencing write performance test.
Sun May  5 12:41:11 2013

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
------    ---------  ---------- ---------  --------   --------   --------  --------   ----
write     3143.64    8388608    1024.00    0.001566   20.85      1.97       20.85      0    XXCEL
Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write        3143.64    3143.64     3143.64      0.00    3143.64    3143.64     3143.64      0.00  20.84715   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL

Max Write: 3143.64 MiB/sec (3296.35 MB/sec)

Run finished: Sun May  5 12:41:32 2013
Comment by Niu Yawei (Inactive) [ 06/May/13 ]

The commit 5661651b2cc6414686e7da581589c2ea0e1f1969 is from LU-2139, which added unstable pages accounting for Lustre, and following code changes could caused much more sync write.

@@ -1463,7 +1465,8 @@ static int osc_enter_cache_try(struct client_obd *cli,
                return 0;

        if (cli->cl_dirty + CFS_PAGE_SIZE <= cli->cl_dirty_max &&
-           cfs_atomic_read(&obd_dirty_pages) + 1 <= obd_max_dirty_pages) {
+           cfs_atomic_read(&obd_unstable_pages) + 1 +
+           cfs_atomic_read(&obd_dirty_pages) <= obd_max_dirty_pages) {
                osc_consume_write_grant(cli, &oap->oap_brw_page);
                if (transient) {
                        cli->cl_dirty_transit += CFS_PAGE_SIZE;
@@ -1576,9 +1579,9 @@ void osc_wake_cache_waiters(struct client_obd *cli)

                ocw->ocw_rc = -EDQUOT;
                /* we can't dirty more */
-               if ((cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max) ||
-                   (cfs_atomic_read(&obd_dirty_pages) + 1 >
-                    obd_max_dirty_pages)) {
+               if (cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max ||
+                   cfs_atomic_read(&obd_unstable_pages) + 1 +
+                   cfs_atomic_read(&obd_dirty_pages) > obd_max_dirty_pages) {
                        CDEBUG(D_CACHE, "no dirty room: dirty: %ld "
                               "osc max %ld, sys max %d\n", cli->cl_dirty,
                               cli->cl_dirty_max, obd_max_dirty_pages);

I think the title of this ticket should be changed as "LU-2139 may cause the performance regression".

Comment by Jodi Levi (Inactive) [ 06/May/13 ]

Oleg will revert the patch in LU-2139 that caused this regression.

Comment by Peter Jones [ 06/May/13 ]

The patch has now been reverted - Ihara can you confirm whether the problem has now disappeared?

Comment by Prakash Surya (Inactive) [ 06/May/13 ]

How much memory is there on the client? And what is the commit frequency on the servers? I would expect the performance to be worse with the patch Niu points to if the client has sufficient bandwidth (combined with async server commits) to fill it's available dirty page space with unstable pages. So this performance regression might be "working as intended", but that all depends on how many unstable pages the client is exhausting during the test.

Can you sample lctl get_param 'llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo a few times while the test is running to give me an idea of the number of unstable pages are being consumed? If this value is anywhere near the limit set in /proc/sys/lustre/max_dirty_mb then maybe we need to rethink the default value of max_dirty_mb and set it to something larger.

Comment by Shuichi Ihara (Inactive) [ 08/May/13 ]

Peter, yes, the performance is back with the latest commit (2.3.65). However, I'm hitting another issue with IOR.. will figure out and if it's different problem, I will open new ticket.

Comment by Shuichi Ihara (Inactive) [ 08/May/13 ]

Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. I also collected collectl log and found out the client was not writing always. it's doing like this writing; idle, writing; idle;...

our client's memory is 64GB and tested larger max_dirty_mb(tested more than 3/4 of memory size from default 1/2), but it didn't help either.

Comment by Prakash Surya (Inactive) [ 08/May/13 ]

Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR.

Thanks, I'll give these a look.

it's doing like this writing; idle, writing; idle;...

OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain..

So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages, balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk.

You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages. What do these files show on you system: /proc/sys/vm/dirty_background_bytes, /proc/sys/vm/dirty_background_ratio, /proc/sys/vm/dirty_bytes, /proc/sys/vm/dirty_ratio? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory, if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit.

If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full LU-2139 patch stack applied to the client and servers (http://review.whamcloud.com/4245, http://review.whamcloud.com/4374, http://review.whamcloud.com/4375, http://review.whamcloud.com/5935), I'd expect this effect would go away.

If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.

Comment by Jodi Levi (Inactive) [ 08/May/13 ]

Lowering priority as LU-2139 was reverted.

Comment by Prakash Surya (Inactive) [ 08/May/13 ]

In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment:

    fio           S 00000fffae72633c     0 59338  59283 0x00000000
    Call Trace:
    [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
    [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
    [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
    [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
    [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
    [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
    [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
    [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
    [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
    [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
    [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
    [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
    [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
    [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
    [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
    [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
    [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
    [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c
Comment by Prakash Surya (Inactive) [ 11/Sep/13 ]

Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so.

General discussion of the LU-2139 issue and pending patch stack is better suited to happen in LU-2139 ticket.

Generated at Sat Feb 10 01:32:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.