[LU-3277] LU-2139 may cause the performance regression - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.4.0
Labels:
- HB
Environment:
RHEL6.3 and current master

Severity:
2
Rank (Obsolete):
8114

Description

There is a performance regression on the current master(c864582b5d4541c7830d628457e55cd859aee005) if we have multiple IOR threads per client. As far as I can test, ~~LU-2576~~ might cause this performance regression. Here is quick test results on each commit.

client : commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c

# mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Sun May  5 12:24:09 2013
Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
Using synchronized MPI timer
Start time skew across all tasks: 0.00 sec
Path: /lustre/ior.out
FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
Participating tasks: 8
Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
task 0 on s08
task 1 on s08
task 2 on s08
task 3 on s08
task 4 on s08
task 5 on s08
task 6 on s08
task 7 on s08

Summary:
	api                = POSIX
	test filename      = /lustre/ior.out/file
	access             = file-per-process
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file=constant task offsets = 1
	clients            = 8 (8 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 8 GiB
	aggregate filesize = 64 GiB

Using Time Stamp 1367781849 (0x5186b1d9) for Data Signature
Commencing write performance test.
Sun May  5 12:24:09 2013

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
------    ---------  ---------- ---------  --------   --------   --------  --------   ----
write     3228.38    8388608    1024.00    0.001871   20.30      1.34       20.30      0    XXCEL
Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write        3228.38    3228.38     3228.38      0.00    3228.38    3228.38     3228.38      0.00  20.29996   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL

Max Write: 3228.38 MiB/sec (3385.20 MB/sec)

Run finished: Sun May  5 12:24:30 2013

client : commit 5661651b2cc6414686e7da581589c2ea0e1f1969

# mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Sun May  5 12:16:35 2013
Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
Using synchronized MPI timer
Start time skew across all tasks: 0.00 sec
Path: /lustre/ior.out
FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
Participating tasks: 8
Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
task 0 on s08
task 1 on s08
task 2 on s08
task 3 on s08
task 4 on s08
task 5 on s08
task 6 on s08
task 7 on s08

Summary:
	api                = POSIX
	test filename      = /lustre/ior.out/file
	access             = file-per-process
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file=constant task offsets = 1
	clients            = 8 (8 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 8 GiB
	aggregate filesize = 64 GiB

Using Time Stamp 1367781395 (0x5186b013) for Data Signature
Commencing write performance test.
Sun May  5 12:16:35 2013

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
------    ---------  ---------- ---------  --------   --------   --------  --------   ----
write     550.28     8388608    1024.00    0.001730   119.10     2.76       119.10     0    XXCEL
Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write         550.28     550.28      550.28      0.00     550.28     550.28      550.28      0.00 119.09623   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL

Max Write: 550.28 MiB/sec (577.01 MB/sec)

Run finished: Sun May  5 12:18:34 2013

Both tests, the servers are running current master (c864582b5d4541c7830d628457e55cd859aee005)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

collectl.log
13 kB
08/May/13 4:20 PM
stat.log
16 kB
08/May/13 4:20 PM

Issue Links

is blocking

LU-2139 Tracking unstable pages

Resolved

is related to

LU-2576 Hangs in osc_enter_cache due to dirty pages not being flushed

Resolved

LU-2139 Tracking unstable pages

Resolved

Activity

[LU-3277] LU-2139 may cause the performance regression

Prakash Surya (Inactive) made changes - 11/Sep/13 10:22 PM

Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Prakash Surya (Inactive) added a comment - 11/Sep/13 10:22 PM

Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so.

General discussion of the ~~LU-2139~~ issue and pending patch stack is better suited to happen in ~~LU-2139~~ ticket.

Prakash Surya (Inactive) added a comment - 11/Sep/13 10:22 PM Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so. General discussion of the LU-2139 issue and pending patch stack is better suited to happen in LU-2139 ticket.

Andreas Dilger made changes - 05/Sep/13 7:24 PM

Assignee

Original: Niu Yawei [ niu ]

New: Prakash Surya [ prakash ]

Andreas Dilger made changes - 05/Sep/13 7:24 PM

Link

New: This issue is blocking ~~LU-2139~~ [ ~~LU-2139~~ ]

Andreas Dilger made changes - 05/Sep/13 7:23 PM

Fix Version/s

New: Lustre 2.6.0 [ 10595 ]

Prakash Surya (Inactive) added a comment - 08/May/13 11:49 PM

In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment:

    fio           S 00000fffae72633c     0 59338  59283 0x00000000
    Call Trace:
    [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
    [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
    [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
    [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
    [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
    [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
    [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
    [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
    [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
    [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
    [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
    [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
    [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
    [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
    [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
    [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
    [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
    [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c

Prakash Surya (Inactive) added a comment - 08/May/13 11:49 PM In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment: fio S 00000fffae72633c 0 59338 59283 0x00000000 Call Trace: [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable) [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100 [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0 [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240 [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98 [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390 [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354 [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8 [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8 [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre] [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass] [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass] [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre] [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre] [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre] [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4 [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98 [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c

Jodi Levi (Inactive) made changes - 08/May/13 7:24 PM

Priority

Original: Blocker [ 1 ]

New: Major [ 3 ]

Jodi Levi (Inactive) added a comment - 08/May/13 7:24 PM

Lowering priority as ~~LU-2139~~ was reverted.

Jodi Levi (Inactive) added a comment - 08/May/13 7:24 PM Lowering priority as LU-2139 was reverted.

Prakash Surya (Inactive) added a comment - 08/May/13 6:06 PM - edited

Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR.

Thanks, I'll give these a look.

it's doing like this writing; idle, writing; idle;...

OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain..

So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages, balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk.

You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages. What do these files show on you system: /proc/sys/vm/dirty_background_bytes, /proc/sys/vm/dirty_background_ratio, /proc/sys/vm/dirty_bytes, /proc/sys/vm/dirty_ratio? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory, if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit.

If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full ~~LU-2139~~ patch stack applied to the client and servers (http://review.whamcloud.com/4245, http://review.whamcloud.com/4374, http://review.whamcloud.com/4375, http://review.whamcloud.com/5935), I'd expect this effect would go away.

If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.

Prakash Surya (Inactive) added a comment - 08/May/13 6:06 PM - edited Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. Thanks, I'll give these a look. it's doing like this writing; idle, writing; idle;... OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain.. So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages , balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk. You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages . What do these files show on you system: /proc/sys/vm/dirty_background_bytes , /proc/sys/vm/dirty_background_ratio , /proc/sys/vm/dirty_bytes , /proc/sys/vm/dirty_ratio ? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory , if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit. If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full LU-2139 patch stack applied to the client and servers ( http://review.whamcloud.com/4245 , http://review.whamcloud.com/4374 , http://review.whamcloud.com/4375 , http://review.whamcloud.com/5935 ), I'd expect this effect would go away. If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.

Shuichi Ihara (Inactive) made changes - 08/May/13 4:20 PM

Attachment		New: collectl.log [ 12639 ]
Attachment		New: stat.log [ 12640 ]

People

Assignee:: Prakash Surya (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 05/May/13 7:30 PM

Updated:: 11/Sep/13 10:22 PM

Resolved:: 11/Sep/13 10:22 PM