[LU-3416] Hangs on write in osc_enter_cache() Created: 30/May/13 Updated: 20/Oct/22 Resolved: 29/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mq313 | ||
| Environment: |
Lustre 2.4.0-RC1_3chaos. https://github.com/chaos/lustre/tree/2.4.0-RC1_3chaos, ZFS servers |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 8471 | ||||||||||||
| Description |
|
It looks like We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here: 2013-05-29 16:15:36.363547 sysiod S 00000fffa85363bc 0 4111 3070 0x00000000 2013-05-29 16:15:36.363582 Call Trace: 2013-05-29 16:15:36.363617 [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable) 2013-05-29 16:15:36.363653 [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100 2013-05-29 16:15:36.363688 [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0 2013-05-29 16:15:36.363723 [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs] 2013-05-29 16:15:36.363758 [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc] 2013-05-29 16:15:36.363793 [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc] 2013-05-29 16:15:36.363828 [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc] 2013-05-29 16:15:36.363863 [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass] 2013-05-29 16:15:36.363898 [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov] 2013-05-29 16:15:36.363934 [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass] 2013-05-29 16:15:36.363968 [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre] 2013-05-29 16:15:36.364003 [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass] 2013-05-29 16:15:36.364038 [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre] 2013-05-29 16:15:36.364074 [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre] 2013-05-29 16:15:36.364109 [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388 2013-05-29 16:15:36.364143 [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8 2013-05-29 16:15:36.364178 [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8 2013-05-29 16:15:36.364213 [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre] 2013-05-29 16:15:36.364248 [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass] 2013-05-29 16:15:36.364283 [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass] 2013-05-29 16:15:36.364317 [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre] 2013-05-29 16:15:36.364352 [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre] 2013-05-29 16:15:36.364387 [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre] 2013-05-29 16:15:36.364422 [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4 2013-05-29 16:15:36.364458 [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98 2013-05-29 16:15:36.364492 [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c This was with Lustre 2.4.0-RC1_3chaos. |
| Comments |
| Comment by Christopher Morrone [ 30/May/13 ] |
|
Just as reported in echo 3 > /proc/sys/vm/drop_caches |
| Comment by Niu Yawei (Inactive) [ 30/May/13 ] |
|
Hi, Chris
Thanks. |
| Comment by Christopher Morrone [ 30/May/13 ] |
I pointed you to the exact tag that we are running. Our patches in the branch are clearly based at 2.4.0-RC1. So yes, of course we have that fix applied. You guys applied it.
Again, see the tag that I included in the bug report. This branch does not include any experimental work to improve the git diff 2.4.0-RC1 2.4.0-RC1_3chaos | grep lli_write_mutex and it did not report any results, so no direct changes that I am aware of.
max_dirty_mb is 32 for all OSC. I didn't get cur_dirty_bytes. I did capture dump_page_cache. See attached. Keep in mind that BG/Q is ppc64, and pages are 64K. Looks like there were 7129 dirty pages. At 64K each, that is 456,256 KiB. There are 16 OSTs, so on average there there were 28,516 KiB dirty. It seems pretty likely that some of the OSTs were at the max_dirty_mb limit.
3000 |
| Comment by Christopher Morrone [ 30/May/13 ] |
|
And yes, the threads are stuck for much, much longer than the 30 second dirty_export_centiseconds value. As I said, they appear to be stuck indefinitely. My guess is that they will never make progress without external intervention. They appeared to be stuck for at least 2 hours before I intervened by triggering drop_caches. |
| Comment by Christopher Morrone [ 31/May/13 ] |
|
A "sync" command on the command line also successfully clears the outstanding dirty data. Here is cur_dirty_bytes for one of the clients that is hanging on write: osc.lsrzb-OST0001-osc-c0000003c6ed3400.cur_dirty_bytes=27852800 osc.lsrzb-OST0002-osc-c0000003c6ed3400.cur_dirty_bytes=25755648 osc.lsrzb-OST0003-osc-c0000003c6ed3400.cur_dirty_bytes=28246016 osc.lsrzb-OST0004-osc-c0000003c6ed3400.cur_dirty_bytes=26935296 osc.lsrzb-OST0005-osc-c0000003c6ed3400.cur_dirty_bytes=26411008 osc.lsrzb-OST0006-osc-c0000003c6ed3400.cur_dirty_bytes=29229056 osc.lsrzb-OST0007-osc-c0000003c6ed3400.cur_dirty_bytes=26083328 osc.lsrzb-OST0008-osc-c0000003c6ed3400.cur_dirty_bytes=27459584 osc.lsrzb-OST0009-osc-c0000003c6ed3400.cur_dirty_bytes=27787264 osc.lsrzb-OST000a-osc-c0000003c6ed3400.cur_dirty_bytes=29163520 osc.lsrzb-OST000b-osc-c0000003c6ed3400.cur_dirty_bytes=27656192 osc.lsrzb-OST000c-osc-c0000003c6ed3400.cur_dirty_bytes=27131904 osc.lsrzb-OST000d-osc-c0000003c6ed3400.cur_dirty_bytes=25362432 osc.lsrzb-OST000e-osc-c0000003c6ed3400.cur_dirty_bytes=27525120 osc.lsrzb-OST000f-osc-c0000003c6ed3400.cur_dirty_bytes=27197440 osc.lsrzb-OST0010-osc-c0000003c6ed3400.cur_dirty_bytes=27262976 The dirty data is actually pretty evenly spread. There were 512 files being written through this client at the time of the hang. I also note that the flush-lustre-1 thread is waking up regularly, but does not seem to result in any of the dirty data being written. I enabled full lustre debugging watched flush-lustre-1 run for a bit in top, and then dumped the lustre log. See attached file rzuseqio14_lustre_log.txt I haven't stepped through the code and logs long enough yet to understand why it is doing so much but failing to trigger the sending of the dirty data. |
| Comment by Niu Yawei (Inactive) [ 31/May/13 ] |
|
The log (rzuseqio14_lustre_log.txt) shows that ll_writepages() didn't trigger any dirty flush at all: ... 00000008:00000001:21.0:1369956379.455717:3872:10295:0:(osc_io.c:611:osc_io_fsync_start()) Process entered 00000008:00000001:21.0:1369956379.455724:4368:10295:0:(osc_cache.c:2914:osc_cache_writeback_range()) Process entered 00000008:00000020:21.0:1369956379.455732:4368:10295:0:(osc_cache.c:3010:osc_cache_writeback_range()) obj c0000002eacf46e0 ready 0|-|- wr 15|-|- rd 0|- cache page out. 00000008:00000001:21.0:1369956379.455744:4496:10295:0:(osc_cache.c:3011:osc_cache_writeback_range()) Process leaving (rc=0 : 0 : 0) ... I think these dirty pages are supposed to be on the oo_urgent_exts, but seems they were not on either oo_hp_exts or oo_urgent_exts. Jinshan, any idea how this could happen? |
| Comment by Jinshan Xiong (Inactive) [ 31/May/13 ] |
|
Obviously there existed an active extent so that osc_io_fsync_start() didn't move it into urgent list. Hi Chris, just to clarify, both `echo 3 > drop_caches' and sync will trigger it out of hang state? |
| Comment by Jinshan Xiong (Inactive) [ 31/May/13 ] |
|
It looks like this issue can only be seen on ppc clients, so let's drop the priority to not block 2.4 release while we're working on it. |
| Comment by Christopher Morrone [ 31/May/13 ] |
Yes, that is correct.
I disagree. Just because we discovered it on ppc clients doesn't mean it is a ppc-only problem. |
| Comment by Jinshan Xiong (Inactive) [ 31/May/13 ] |
|
Hi Chris, if you see this issue again, try if `echo 1 > drop_caches' can also release the hang state. Also it will be helpful to check how much grants are there at the corresponding osc. |
| Comment by Jinshan Xiong (Inactive) [ 31/May/13 ] |
|
here is another global dirty page parameter: obd_dirty_pages and obd_max_dirty_pages. Can you please check it as well? |
| Comment by Andreas Dilger [ 31/May/13 ] |
|
Chris, how frequent are the hangs, and how many systems does this affect? We've had several days of SWL runs on recent Lustre tags without seeing anything similar on x86_64 clients. While I don't want to minimize the importance of this issue, the main question is whether this problem is serious and widespread enough that it needs to block 2.4.0 from being released to any users, or whether it is possible to fix this after the 2.4.0 release? LLNL will be running a patched version of Lustre regardless of whether this bug is fixed before 2.4.0 or afterward, so if this is more prevalent with PPC clients or ZFS servers it will not affect a majority of Lustre users. Also, by making Lustre 2.4.0 available to a wider user base the 2.4.0 code will get more testing and fixes, which helps LLNL as well. This bug is the only one holding the 2.4.0 release at this point, so we'd really prefer to drop it from the blockers list and release 2.4.0. |
| Comment by Peter Jones [ 31/May/13 ] |
|
Andreas I just gave Chris a call and he agrees that, while this may not necessarily be a PPC issue, that the amount of testing elsewhere demonstrates that this is a rare issue to hit and one that we can defer to 2.4.1. Peter |
| Comment by Christopher Morrone [ 31/May/13 ] |
|
Where would I find those? Here's what I found in proc: rzuseqio16-ib0@root:lctl get_param 'osc.*.cur_dirty_bytes' osc.lsrzb-OST0001-osc-c0000003c737a400.cur_dirty_bytes=27656192 osc.lsrzb-OST0002-osc-c0000003c737a400.cur_dirty_bytes=27983872 osc.lsrzb-OST0003-osc-c0000003c737a400.cur_dirty_bytes=30146560 osc.lsrzb-OST0004-osc-c0000003c737a400.cur_dirty_bytes=28508160 osc.lsrzb-OST0005-osc-c0000003c737a400.cur_dirty_bytes=28901376 osc.lsrzb-OST0006-osc-c0000003c737a400.cur_dirty_bytes=30277632 osc.lsrzb-OST0007-osc-c0000003c737a400.cur_dirty_bytes=29294592 osc.lsrzb-OST0008-osc-c0000003c737a400.cur_dirty_bytes=27394048 osc.lsrzb-OST0009-osc-c0000003c737a400.cur_dirty_bytes=29884416 osc.lsrzb-OST000a-osc-c0000003c737a400.cur_dirty_bytes=28835840 osc.lsrzb-OST000b-osc-c0000003c737a400.cur_dirty_bytes=29753344 osc.lsrzb-OST000c-osc-c0000003c737a400.cur_dirty_bytes=27262976 osc.lsrzb-OST000d-osc-c0000003c737a400.cur_dirty_bytes=29818880 osc.lsrzb-OST000e-osc-c0000003c737a400.cur_dirty_bytes=27590656 osc.lsrzb-OST000f-osc-c0000003c737a400.cur_dirty_bytes=28049408 osc.lsrzb-OST0010-osc-c0000003c737a400.cur_dirty_bytes=27983872 rzuseqio16-ib0@root:lctl get_param 'osc.*.cur_grant_bytes' osc.lsrzb-OST0001-osc-c0000003c737a400.cur_grant_bytes=33554432 osc.lsrzb-OST0002-osc-c0000003c737a400.cur_grant_bytes=51511296 osc.lsrzb-OST0003-osc-c0000003c737a400.cur_grant_bytes=41746432 osc.lsrzb-OST0004-osc-c0000003c737a400.cur_grant_bytes=36831232 osc.lsrzb-OST0005-osc-c0000003c737a400.cur_grant_bytes=38010880 osc.lsrzb-OST0006-osc-c0000003c737a400.cur_grant_bytes=35848192 osc.lsrzb-OST0007-osc-c0000003c737a400.cur_grant_bytes=33488896 osc.lsrzb-OST0008-osc-c0000003c737a400.cur_grant_bytes=37093376 osc.lsrzb-OST0009-osc-c0000003c737a400.cur_grant_bytes=63897600 osc.lsrzb-OST000a-osc-c0000003c737a400.cur_grant_bytes=35454976 osc.lsrzb-OST000b-osc-c0000003c737a400.cur_grant_bytes=37683200 osc.lsrzb-OST000c-osc-c0000003c737a400.cur_grant_bytes=33488896 osc.lsrzb-OST000d-osc-c0000003c737a400.cur_grant_bytes=34275328 osc.lsrzb-OST000e-osc-c0000003c737a400.cur_grant_bytes=33751040 osc.lsrzb-OST000f-osc-c0000003c737a400.cur_grant_bytes=35454976 osc.lsrzb-OST0010-osc-c0000003c737a400.cur_grant_bytes=40828928 rzuseqio16-ib0@root:lctl get_param 'osc.*.cur_lost_grant_bytes' osc.lsrzb-OST0001-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST0002-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST0003-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST0004-osc-c0000003c737a400.cur_lost_grant_bytes=1073803264 osc.lsrzb-OST0005-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST0006-osc-c0000003c737a400.cur_lost_grant_bytes=1073803264 osc.lsrzb-OST0007-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST0008-osc-c0000003c737a400.cur_lost_grant_bytes=1073803264 osc.lsrzb-OST0009-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST000a-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST000b-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST000c-osc-c0000003c737a400.cur_lost_grant_bytes=1073803264 osc.lsrzb-OST000d-osc-c0000003c737a400.cur_lost_grant_bytes=1073803264 osc.lsrzb-OST000e-osc-c0000003c737a400.cur_lost_grant_bytes=2147606528 osc.lsrzb-OST000f-osc-c0000003c737a400.cur_lost_grant_bytes=0 osc.lsrzb-OST0010-osc-c0000003c737a400.cur_lost_grant_bytes=0 rzuseqio16-ib0@root:lctl get_param 'osc.*.max_dirty_mb' osc.lsrzb-OST0001-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0002-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0003-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0004-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0005-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0006-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0007-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0008-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0009-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST000a-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST000b-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST000c-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST000d-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST000e-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST000f-osc-c0000003c737a400.max_dirty_mb=32 osc.lsrzb-OST0010-osc-c0000003c737a400.max_dirty_mb=32 |
| Comment by Christopher Morrone [ 01/Jun/13 ] |
Like Peter said, I don't particularly care about if this problem gets fixed in 2.4.0. But it is a very frequent problem. In fact, it may have never been fixed like we thought a few months ago. I've learned that a sysadmins was running "echo 3 > /proc/sys/vm/drop_caches" to get around the problem. But I was not aware that it was still a problem. It affects all of our production systems that are running pre-2.4. Of course, we're only doing that on BG/Q systems with ZFS. |
| Comment by Andreas Dilger [ 01/Jun/13 ] |
|
Chris, I'm just on my phone so can't verify, but I believe the parameters Jinshan is referencing are global parameters in llite or at the top level. |
| Comment by Christopher Morrone [ 03/Jun/13 ] |
|
This is from the same useqio16 node as last Friday. Still hung in the same place after the full weekend. The obd_dirty_pages number exactly matches the sum of the 16 cur_dirty_bytes numbers (keeping in mind that pages are 64K on this platform): 459,341,824 bytes. crash> print obd_dirty_pages
$8 = {
counter = 7009
}
crash> print obd_max_dirty_pages
$9 = 131071
|
| Comment by Christopher Morrone [ 04/Jun/13 ] |
|
Any ideas? |
| Comment by Niu Yawei (Inactive) [ 05/Jun/13 ] |
|
Chirs, I was looking at related code these days, but didn't find any clue so far. btw, were there many truncate operations? |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/13 ] |
|
I can take a look at this issue if it's okay for you guys. |
| Comment by Peter Jones [ 05/Jun/13 ] |
|
Yes please Jinshan! |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/13 ] |
|
Please apply patch: http://review.whamcloud.com/6554 to your tree to print accurate information when this problem happens. I tend to suspect a wakeup is missed somewhere but need to verify. |
| Comment by Christopher Morrone [ 05/Jun/13 ] |
|
Ok, I'll apply that and get the info. But there certainly seems to be an additional problem here. The Lustre client seems to be rather poorly integrated with the kernel's way of doing things. The kernel is clearly trying to write out these dirty pages, and Lustre is completely ignores the kernel. That is Bad. I would like to see that problem fixed as well. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/13 ] |
|
Hi Chris, not sure I understand. You mean the dirty page won't be written back if EDQUOT is returned? Actually in vvp_io_commit_write(), if EDQUOT is seen, Lustre will try to write the page in sync mode. |
| Comment by Christopher Morrone [ 05/Jun/13 ] |
|
I don't know what EDQUOT is, so no. I am referring to the kernel's normal BDI mechanism for regularly pushing dirty to disk. The calls are being made into Lustre, but Lustre decides to ignore them. I don't understand the whole call stack yet (by a long shot), but clearly the kernel's normal mechanisms get the Lustre code into osc_cache_writeback_range, where it notes that dirty pages exist. However, lustre chooses to do absolutely nothing with that information, rather than obeying the kernel's request to write out the dirty pages. That has got to have all kinds of bad implications for low memory situations on nodes. And not too surprising, we have all kinds of memory problems with the new Lustre client. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/13 ] |
|
Do you mean the call path of ->writepage()? In this case, it will mark the dirty page as urgent and then wait for it to be written ll_writepage(). Yes, I agree that the current client has problem of memory and performance, and I will give you an explanation soon. |
| Comment by Christopher Morrone [ 06/Jun/13 ] |
|
Jinshan, your patch does hit ten minutes after the hang. Here are the 6 lines that popped out on the console from one of the clients: 2013-06-06 15:05:48.739336 {DefaultControlEventListener} [mmcs]{22}.1.1: LustreError: 4820:0:(osc_cache.c:1559:osc_enter_cache()) lsrzb-OST000d-osc-c0000003c7540c00: {
dirty: 33554432/33554432 dirty_pages: 6441/131071 unstable_pages: 0/131071 dropped: 2147606528 avail: 27459584, reserved: 0, flight: 0 } try to reserve 65536.
2013-06-06 15:05:48.740174 {DefaultControlEventListener} [mmcs]{22}.4.0: LustreError: 4726:0:(osc_cache.c:1559:osc_enter_cache()) lsrzb-OST000d-osc-c0000003c7540c00: {
dirty: 33554432/33554432 dirty_pages: 6441/131071 unstable_pages: 0/131071 dropped: 2147606528 avail: 27459584, reserved: 0, flight: 0 } try to reserve 65536.
2013-06-06 15:05:48.740565 {DefaultControlEventListener} [mmcs]{22}.2.3: LustreError: 4751:0:(osc_cache.c:1559:osc_enter_cache()) lsrzb-OST000d-osc-c0000003c7540c00: {
dirty: 33554432/33554432 dirty_pages: 6441/131071 unstable_pages: 0/131071 dropped: 0 avail: 27459584, reserved: 0, flight: 0 } try to reserve 65536.
2013-06-06 15:05:48.740946 {DefaultControlEventListener} [mmcs]{22}.7.2: LustreError: 4733:0:(osc_cache.c:1559:osc_enter_cache()) lsrzb-OST000d-osc-c0000003c7540c00: {
dirty: 33554432/33554432 dirty_pages: 6441/131071 unstable_pages: 0/131071 dropped: 0 avail: 27459584, reserved: 0, flight: 0 } try to reserve 65536.
2013-06-06 15:05:48.741355 {DefaultControlEventListener} [mmcs]{22}.13.0: LustreError: 4830:0:(osc_cache.c:1559:osc_enter_cache()) lsrzb-OST000d-osc-c0000003c7540c00:
{ dirty: 33554432/33554432 dirty_pages: 6441/131071 unstable_pages: 0/131071 dropped: 0 avail: 27459584, reserved: 0, flight: 0 } try to reserve 65536.
2013-06-06 15:05:48.741649 {DefaultControlEventListener} [mmcs]{22}.10.0: LustreError: 4724:0:(osc_cache.c:1559:osc_enter_cache()) lsrzb-OST000d-osc-c0000003c7540c00: { dirty: 33554432/33554432 dirty_pages: 6441/131071 unstable_pages: 0/131071 dropped: 0 avail: 27459584, reserved: 0, flight: 1 } try to reserve 65536.
You patch seems to keep data flowing. But with 10 minute pauses here and there, of course. Do you understand how to fix this, then? Also, be aware that we are using the following settings on our BG/Q clients: sysctl vm.min_free_kbytes=1048576 /usr/sbin/lctl set_param "llite.*.max_cached_mb=4096" /usr/sbin/lctl set_param "llite.*.max_read_ahead_mb=10" |
| Comment by Jinshan Xiong (Inactive) [ 06/Jun/13 ] |
|
About the cache, another useful information would be: lctl get_param osc.*.osc_cached_mb so that I know how busy it is. |
| Comment by Jinshan Xiong (Inactive) [ 07/Jun/13 ] |
|
Apparently I should have adjusted cl_dirty in osc_unreserve_grant() as well, working on a patch... |
| Comment by Jinshan Xiong (Inactive) [ 07/Jun/13 ] |
|
Can you please try patch set 2 of http://review.whamcloud.com/6554? |
| Comment by Christopher Morrone [ 07/Jun/13 ] |
|
Yes, I'll try it out. Thanks, Jinshan. |
| Comment by Christopher Morrone [ 07/Jun/13 ] |
|
The problem seems to have disappeared with Patch Set 2 in place. But now write performance is unacceptably bad. Four BG/Q clients are only pushing a total of around 720 MB/s (180MB/S each). That number should be closer to 1GB/s per client. |
| Comment by Jinshan Xiong (Inactive) [ 07/Jun/13 ] |
|
this patch should have nothing to do with the performance. Maybe you can try to roll back to unpatched version and see if it's due to env settings. Of course, we can simplify the problem by using dd to check performance. If you can still see the performance drop, please try to collect a log with D_CACHE enable (lctl set_param debug=cache), I will take a look to make sure it's in the correct path. |
| Comment by Christopher Morrone [ 07/Jun/13 ] |
|
I double checked, and I am certain that the patch is the only difference. FYI, I've moved to Lustre version 2.4.0-RC2_2chaos on both clients and servers. Without the patch, I see write throughput burst to 4.4GB/s intially (as seen in ltop), and then taper off to a fairly constant rate in the low 3GB/s range. With the patch, I see 720 MB/s. (These are aggregate throughput for 4 BG/Q clients.) Keep in mind that our tree has the unstable page tracking changes. I don't know if there is a relation there or not, but you should be aware. Without your patch, at the beginning of an IOR write phase, you might see something like this: rzuseqio13-ib0@morrone:lctl get_param "osc.*.cur_dirty_bytes" osc.lsrzb-OST0001-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0002-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0003-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0004-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0005-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0006-osc-c0000003e18f6400.cur_dirty_bytes=33161216 osc.lsrzb-OST0007-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0008-osc-c0000003e18f6400.cur_dirty_bytes=32899072 osc.lsrzb-OST0009-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000a-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000b-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000c-osc-c0000003e18f6400.cur_dirty_bytes=33030144 osc.lsrzb-OST000d-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000e-osc-c0000003e18f6400.cur_dirty_bytes=12582912 osc.lsrzb-OST000f-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0010-osc-c0000003e18f6400.cur_dirty_bytes=33554432 And that seems fairly normal. But with patch set 2 of 6554 applied I see these bogus numbers: rzuseqlac2:~/BG-191$ rsh rzuseqio13 '/usr/sbin/lctl get_param "osc.*.cur_dirty_bytes"' osc.lsrzb-OST0001-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0002-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0003-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0004-osc-c0000003ec957c00.cur_dirty_bytes=18446744073703260160 osc.lsrzb-OST0005-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0006-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0007-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706930176 osc.lsrzb-OST0008-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0009-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000a-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000b-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706536960 osc.lsrzb-OST000c-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000d-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000e-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000f-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0010-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 I'm guessing that those large numbers are simply an arithmetic problem, where we're allowing the number to become negative. That somehow results in very slow write throughput, perhaps because all operations are not completely synchronous? |
| Comment by Jinshan Xiong (Inactive) [ 07/Jun/13 ] |
|
ah yes, indeed. The dirty bytes are actually negative numbers. I'll check it. |
| Comment by Jinshan Xiong (Inactive) [ 07/Jun/13 ] |
|
I found the problem. Working on a patch. |
| Comment by Jinshan Xiong (Inactive) [ 10/Jun/13 ] |
|
Will you please try patch set 3 and see if it helps. |
| Comment by Christopher Morrone [ 11/Jun/13 ] |
|
Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow. |
| Comment by Christopher Morrone [ 11/Jun/13 ] |
|
Here's output from one of the debug messages before I leave for the day: LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.
LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.
|
| Comment by Jinshan Xiong (Inactive) [ 11/Jun/13 ] |
|
This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something? Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause. |
| Comment by Christopher Morrone [ 11/Jun/13 ] |
|
See new attachment rzuseqio15_console.txt.bz2 |
| Comment by Jinshan Xiong (Inactive) [ 12/Jun/13 ] |
|
Can you please try patch set 5? |
| Comment by Christopher Morrone [ 12/Jun/13 ] |
|
Will do. |
| Comment by Christopher Morrone [ 12/Jun/13 ] |
|
I have been running the reproducer, and haven't had a single hang yet with patch set 5 applied. I will leave it running over night. |
| Comment by Jinshan Xiong (Inactive) [ 12/Jun/13 ] |
|
No news assumes good news, no? If this patch can survive, can you please do me a favor to roll back this patch, make the hang happen, and collect logs when doing `echo 3 > drop_caches', I don't understand why this can make it go forward? |
| Comment by Christopher Morrone [ 13/Jun/13 ] |
|
Yes, I haven't been able to reproduce the problem again. Performance is in the accepatble range. I should be able to get that log tomorrow. |
| Comment by Christopher Morrone [ 14/Jun/13 ] |
|
See the attached rzuseq*_drop_caches.bz2 files. When I got a write hange, I enabled full lustre debugging, issued an 'echo 2 > drop_caches', and waited a few seconds, and then dumped the logs on all four clients. Hopefully there is something useful in there. |
| Comment by Jinshan Xiong (Inactive) [ 21/Jun/13 ] |
|
Hi Chris, can you please try the last patch set? If this works, I will land it into master. |
| Comment by Jinshan Xiong (Inactive) [ 21/Jun/13 ] |
|
I took a look at the log but unfortunately I didn't find anything interesting. For example, for log rzuseqio15_drop_caches.txt, obviously process 15336 is the drop_caches process and 15302 is flush process. So I tried to dig activities other than these two processes by: [jinxiong@intel tmp]$ cat rzuseqio15_drop_caches.txt |grep -v '15336:' |grep -v ':15302:' |grep cl_io [jinxiong@intel tmp]$ but nothing is printed out. I don't know why this can happen because the pages are already existing but no process is related to them. It will be helpful to figure out which process was stuck before dropping caches. The purpose of doing this is to fully understand why dropping cache can make the hung process go forward. Though we can fix the problem by latest patch, it's really bad if we don't understand a code path clearly. |
| Comment by Christopher Morrone [ 24/Jun/13 ] |
|
It takes quite a while to run drop_caches. I fear that getting the log of the stuck processes will be quite difficult with the other threads swamping the logs. But I'll keep this as a background task when I have time. In the mean time, I'll update to the latest patch. |
| Comment by Christopher Morrone [ 26/Jun/13 ] |
|
Jinshan, testing of change 6554 patch set 7 went well. No failures, and no new performance problems noted. The commit message will need to be rewritten before landing. |
| Comment by Niu Yawei (Inactive) [ 29/Aug/13 ] |
|
patch landed for 2.5 |