[LU-2576] Hangs in osc_enter_cache due to dirty pages not being flushed Created: 04/Jan/13 Updated: 14/Sep/13 Resolved: 15/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, ptr, sequoia, topsequoia | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 6013 | ||||||||||||||||||||
| Description |
|
We've had reports of IO writes hanging in Sequoia, and after some initial debugging, have narrowed the hung threads to getting stuck with the following stack trace: sysiod S 00000fffa842633c 0 27910 3135 0x00000000 Call Trace: [c0000003ee34e480] [c0000003ee34e590] 0xc0000003ee34e590 (unreliable) [c0000003ee34e650] [c000000000009b8c] .__switch_to+0xc4/0x100 [c0000003ee34e6e0] [c0000000004364c8] .schedule+0x858/0x9c0 [c0000003ee34e990] [8000000000a720c0] .cfs_waitq_wait+0x10/0x30 [libcfs] [c0000003ee34ea00] [8000000004533560] .osc_enter_cache+0x880/0x12c0 [osc] [c0000003ee34ebd0] [800000000453b210] .osc_queue_async_io+0xd10/0x1a40 [osc] [c0000003ee34edf0] [8000000004516fe8] .osc_page_cache_add+0xf8/0x2a0 [osc] [c0000003ee34eeb0] [80000000024468b8] .cl_page_cache_add+0xe8/0x3b0 [obdclass] [c0000003ee34efe0] [8000000004facae8] .lov_page_cache_add+0xc8/0x340 [lov] [c0000003ee34f0b0] [80000000024468b8] .cl_page_cache_add+0xe8/0x3b0 [obdclass] [c0000003ee34f1e0] [80000000067ecda4] .vvp_io_commit_write+0x474/0x8a0 [lustre] [c0000003ee34f300] [800000000246098c] .cl_io_commit_write+0x11c/0x2d0 [obdclass] [c0000003ee34f3c0] [80000000067af410] .ll_commit_write+0x120/0x3e0 [lustre] [c0000003ee34f490] [80000000067d1634] .ll_write_end+0x34/0x80 [lustre] [c0000003ee34f520] [c000000000098bac] .generic_file_buffered_write+0x1ec/0x374 [c0000003ee34f660] [c000000000099290] .__generic_file_aio_write+0x374/0x3d8 [c0000003ee34f760] [c00000000009936c] .generic_file_aio_write+0x78/0xe8 [c0000003ee34f810] [80000000067eff9c] .vvp_io_write_start+0xfc/0x3e0 [lustre] [c0000003ee34f8e0] [800000000245aedc] .cl_io_start+0xcc/0x220 [obdclass] [c0000003ee34f980] [8000000002462cf4] .cl_io_loop+0x194/0x2c0 [obdclass] [c0000003ee34fa30] [800000000676a278] .ll_file_io_generic+0x498/0x670 [lustre] [c0000003ee34fb30] [800000000676a8d4] .ll_file_aio_write+0x1d4/0x3a0 [lustre] [c0000003ee34fc00] [800000000676abf0] .ll_file_write+0x150/0x320 [lustre] [c0000003ee34fce0] [c0000000000d5968] .vfs_write+0xd0/0x1c4 [c0000003ee34fd80] [c0000000000d5b58] .SyS_write+0x54/0x98 [c0000003ee34fe30] [c000000000000580] syscall_exit+0x0/0x2c The specific line it's stuck at is here: (gdb) l *osc_enter_cache+0x880
0x47f50 is in osc_enter_cache (/builddir/build/BUILD/lustre-2.3.56/lustre/osc/osc_cache.c:1526).
1521 /builddir/build/BUILD/lustre-2.3.56/lustre/osc/osc_cache.c: No such file or directory.
in /builddir/build/BUILD/lustre-2.3.56/lustre/osc/osc_cache.c
1523 CDEBUG(D_CACHE, "%s: sleeping for cache space @ %p for %p\n", 1524 cli->cl_import->imp_obd->obd_name, &ocw, oap); 1525 1526 rc = l_wait_event(ocw.ocw_waitq, 1527 cfs_list_empty(&ocw.ocw_entry), &lwi); 1528 1529 client_obd_list_lock(&cli->cl_loi_list_lock); 1530 cfs_list_del_init(&ocw.ocw_entry); There were specific OSCs on a couple of the problem clients showing outstanding dirty pages for a significant amount of time: seqlac2@root:pdsh -w vulcanio[42-43] 'grep -v 0 /proc/fs/lustre/osc/*/cur_dirty_bytes' | dshbak -c ---------------- vulcanio42 ---------------- /proc/fs/lustre/osc/ls1-OST0026-osc-c0000003c1100000/cur_dirty_bytes:65536 /proc/fs/lustre/osc/ls1-OST0027-osc-c0000003c1100000/cur_dirty_bytes:65536 /proc/fs/lustre/osc/ls1-OST0028-osc-c0000003c1100000/cur_dirty_bytes:65536 /proc/fs/lustre/osc/ls1-OST002b-osc-c0000003c1100000/cur_dirty_bytes:65536 ---------------- vulcanio43 ---------------- /proc/fs/lustre/osc/ls1-OST0053-osc-c0000003e336b000/cur_dirty_bytes:65536 Manually dropping caches on the problem clients seems to have cleared things up: # echo 3 > /proc/sys/vm/drop_caches In case it is useful, attached is a dump of the lustre pages taken prior to dropping caches on one of the problem clients: # cat /proc/fs/lustre/llite/ls1-c0000003e336b000/dump_page_cache > vulcanio43-dump_page_cache-`date +%s` I also ran the same command after dropping caches, but the file was empty, so there's no reason to post it. |
| Comments |
| Comment by Prakash Surya (Inactive) [ 04/Jan/13 ] |
|
I should note we're running 2.3.57-2chaos on these clients. |
| Comment by Peter Jones [ 05/Jan/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 06/Jan/13 ] |
|
In osc_enter_cache():
rc = l_wait_event(ocw.ocw_waitq,
cfs_list_empty(&ocw.ocw_entry), &lwi);
I think we should take cl_loi_list_lock when checking the cfs_list_empty(&ocw.ocw_entry), otherwise, the wakeup could be missed. |
| Comment by Niu Yawei (Inactive) [ 06/Jan/13 ] |
| Comment by Christopher Morrone [ 07/Jan/13 ] |
|
Did you do any sanity testing on 4963? We have testing time tomorrow on Sequoia, and I'm willing to pull it in if you have done some minimal testing. Otherwise I would wait for it to run through autotest and maybe test it on Thursday. |
| Comment by Jinshan Xiong (Inactive) [ 07/Jan/13 ] |
|
Apparently this problem is that a dirty page couldn't be written out so that there is no I/O RPC between client and OST, this caused the writing process kept waiting for grants. I don't know why the dirty page was not written back. What are the value of /proc/sys/vm/dirty_XXX on the client node? |
| Comment by Niu Yawei (Inactive) [ 07/Jan/13 ] |
|
Chris, please hold on. On second thought, I realized that 4963 might not be the real fix. Thanks. |
| Comment by Niu Yawei (Inactive) [ 09/Jan/13 ] |
|
I think the wakeup condition in osc_enter_cache() is problematic: It checks only if the ocw is removed from the waiting list, however, in the wakup function osc_wake_cache_waiters(), it's possible return without removing the ocw if the global obd_dirty_pages is too high. I think we'd changed the wakup condition as before: list_empty(&ocw->ocw_entry) || rpcs_in_flight(cli) == 0, then if the obd_dirty_pages is too high & the dirty pages of this osc has been flushed, osc_enter_cache() will not go to sleep. |
| Comment by Jinshan Xiong (Inactive) [ 10/Jan/13 ] |
|
not sure I understand, if there is already too much dirty_pages, it doesn't help anyway |
| Comment by Niu Yawei (Inactive) [ 10/Jan/13 ] |
If there are too many dirty pages (for the whole client but not this specific osc, obd_dirty_pages > obd_max_dirty_pages), osc_enter_cache() should return -EDQUOT to make this osc to start sync write, however with current code, osc_enter_cache() will go to sleep. (see osc_wake_cache_waiters(), no wakeup if obd_dirty_pages is too high) |
| Comment by Prakash Surya (Inactive) [ 10/Jan/13 ] |
I don't see how the updated patch will cause osc_enter_cache() to return -EDQUOT in this case. If obd_dirty_pages > obd_max_dirty_pages then ocw.ocw_rc will not get set to -EDQUOT, right? So then, how would osc_enter_cache() return -EDQUOT? |
| Comment by Niu Yawei (Inactive) [ 10/Jan/13 ] |
I think you are right, we shouldn't override rc by ocw_rc in such case. |
| Comment by Prakash Surya (Inactive) [ 16/Jan/13 ] |
|
Revision 5 looks like it might fix the issue. I'll pull it in to our branch for testing once it goes through Maloo without issues. |
| Comment by Prakash Surya (Inactive) [ 16/Jan/13 ] |
I haven't had access to a system hung in this state, but here's the values from one of the nodes: $ grep . /proc/sys/vm/dirty_* /proc/sys/vm/dirty_background_bytes:0 /proc/sys/vm/dirty_background_ratio:10 /proc/sys/vm/dirty_bytes:0 /proc/sys/vm/dirty_expire_centisecs:3000 /proc/sys/vm/dirty_ratio:20 /proc/sys/vm/dirty_writeback_centisecs:500 |
| Comment by Niu Yawei (Inactive) [ 21/Jan/13 ] |
|
Seems the patchset 5 reveals another problem: In osc_enter_cache(), it's going to wait for grant when cli->cl_dirty > 0, however osc_io_unplug() doesn't tirgger any IO because of oo_nr_writes == 0. I think there should be something wrong in the new extent code. Jingshan, any idea? It can be reproduced by replay-single test_88. |
| Comment by Jinshan Xiong (Inactive) [ 21/Jan/13 ] |
This is possible. If an extent is in OES_ACTIVE state, cl_dirty will be greater than 0 but oo_nr_writes can be 0. However, when the IO is finished, osc_extent_release() will detect that someone is waiting for grant and then unplug the queue immediately. |
| Comment by Niu Yawei (Inactive) [ 22/Jan/13 ] |
|
I see the problem, in osc_enter_cache(): while (cli->cl_dirty > 0 || cli->cl_w_in_flight > 0) {
osc_io_unplug();
wait_for_grant();
}
Actually we shouldn't use 'while' here, because osc_io_unplug() can't trigger dirty flush for the object which isn't on the cl_loi_ready_list or cl_loi_hp_ready_list list (they do have few dirty pages, but not optimal to write out yet), in such case, I think we'd simply return -EDQUOT but not wait in the loop until the dirty pages were flushed. |
| Comment by Jinshan Xiong (Inactive) [ 22/Jan/13 ] |
|
I don't know what the root cause is but apparently the above comment is incorrect because the queue will be unplugged no matter how many dirty pages there are, if there is cache waiters. |
| Comment by Niu Yawei (Inactive) [ 22/Jan/13 ] |
|
Xiong, indeed, my previous comment was wrong. The real problem is that we don't have any API to trigger dirty flush on all objects now, we were using osc_check_rpcs() for this purpose before, but now it's wrapped in osc_io_unplug0(), and unfortunately, what osc_io_unplug0() does is: If current object needs be flushed, flush all dirty object, otherwise, do nothing. So if the current object in osc_enter_cache() has no dirty pages (it's the first page to be put in cache), dirty flush will not be triggered by osc_io_unplug(). |
| Comment by Niu Yawei (Inactive) [ 22/Jan/13 ] |
|
I realized that passing NULL to osc_io_unplug() can trigger dirty flush on all objects, will cook a new fix. |
| Comment by Niu Yawei (Inactive) [ 31/Jan/13 ] |
|
Prakash, does the patch fix your problem? Thanks. |
| Comment by Christopher Morrone [ 31/Jan/13 ] |
|
We don't know yet. Other bugs keep preventing reasonable testing of the |
| Comment by Christopher Morrone [ 01/Feb/13 ] |
|
Testing for this issue went well yesterday. We'll see how the patch holds up to real users over the next week, but I am cautiously optimistic that the problem is fixed. |
| Comment by Jodi Levi (Inactive) [ 15/Feb/13 ] |
|
Has the testing continued to be successful and we can close this ticket? |
| Comment by Christopher Morrone [ 15/Feb/13 ] |
|
Yes, so far no repeats of this bug. |
| Comment by Niu Yawei (Inactive) [ 15/Feb/13 ] |
|
patch landed for 2.4 |
| Comment by Lukasz Flis [ 14/Sep/13 ] |
|
Dear All, Has this patch (#7) been really landed for 2.4 and 2.4.1-RC2? We have observed the issue for 2.4.0 and 2.4.1-RC2 so looks like the problem is still there. Here's our process stack: Unfortunately dropping caches doesn't unblock the process. [root@n0269-g6l ~]# echo 0 > /proc/fs/lustre/osc/scratch-OST0003-osc-ffff8802126ba400/active Could you please have a look? In case more debugging info is needed please let us know. We still keep some Best Regards |
| Comment by Jinshan Xiong (Inactive) [ 14/Sep/13 ] |
|
Hi Lukasz, Do you have http://review.whamcloud.com/6554 in your tree? It looks like we need to cherry pick this patch into b2_4 as well. |
| Comment by Jinshan Xiong (Inactive) [ 14/Sep/13 ] |
|
patch for b2_4 is at: http://review.whamcloud.com/7657 |
| Comment by Lukasz Flis [ 14/Sep/13 ] |
|
Hi Jinshan, Thank you very much for pointing this one out and for patch for 2.4 I see that 2.4.1 is ready for release (packages are ready and published) - maybe it's worth Thank you very much for the quick patch. I'll try to give a feedback asap – |