[LU-2576] Hangs in osc_enter_cache due to dirty pages not being flushed Created: 04/Jan/13  Updated: 14/Sep/13  Resolved: 15/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: MB, ptr, sequoia, topsequoia

Attachments: Text File vulcanio43-dump_page_cache-1357340735    
Issue Links:
Related
is related to LU-2139 Tracking unstable pages Resolved
is related to LU-3261 Threads stuck in osc_enter_cache when... Resolved
is related to LU-3416 Hangs on write in osc_enter_cache() Resolved
is related to LU-3277 LU-2139 may cause the performance reg... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2582 Review and test patch for LU-2576 Technical task Resolved Prakash Surya  
Severity: 3
Rank (Obsolete): 6013

 Description   

We've had reports of IO writes hanging in Sequoia, and after some initial debugging, have narrowed the hung threads to getting stuck with the following stack trace:

 sysiod        S 00000fffa842633c     0 27910   3135 0x00000000                    
 Call Trace:                                                                       
 [c0000003ee34e480] [c0000003ee34e590] 0xc0000003ee34e590 (unreliable)             
 [c0000003ee34e650] [c000000000009b8c] .__switch_to+0xc4/0x100                     
 [c0000003ee34e6e0] [c0000000004364c8] .schedule+0x858/0x9c0                       
 [c0000003ee34e990] [8000000000a720c0] .cfs_waitq_wait+0x10/0x30 [libcfs]          
 [c0000003ee34ea00] [8000000004533560] .osc_enter_cache+0x880/0x12c0 [osc]         
 [c0000003ee34ebd0] [800000000453b210] .osc_queue_async_io+0xd10/0x1a40 [osc]   
 [c0000003ee34edf0] [8000000004516fe8] .osc_page_cache_add+0xf8/0x2a0 [osc]        
 [c0000003ee34eeb0] [80000000024468b8] .cl_page_cache_add+0xe8/0x3b0 [obdclass] 
 [c0000003ee34efe0] [8000000004facae8] .lov_page_cache_add+0xc8/0x340 [lov]        
 [c0000003ee34f0b0] [80000000024468b8] .cl_page_cache_add+0xe8/0x3b0 [obdclass] 
 [c0000003ee34f1e0] [80000000067ecda4] .vvp_io_commit_write+0x474/0x8a0 [lustre]
 [c0000003ee34f300] [800000000246098c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
 [c0000003ee34f3c0] [80000000067af410] .ll_commit_write+0x120/0x3e0 [lustre]       
 [c0000003ee34f490] [80000000067d1634] .ll_write_end+0x34/0x80 [lustre]            
 [c0000003ee34f520] [c000000000098bac] .generic_file_buffered_write+0x1ec/0x374 
 [c0000003ee34f660] [c000000000099290] .__generic_file_aio_write+0x374/0x3d8       
 [c0000003ee34f760] [c00000000009936c] .generic_file_aio_write+0x78/0xe8           
 [c0000003ee34f810] [80000000067eff9c] .vvp_io_write_start+0xfc/0x3e0 [lustre]  
 [c0000003ee34f8e0] [800000000245aedc] .cl_io_start+0xcc/0x220 [obdclass]          
 [c0000003ee34f980] [8000000002462cf4] .cl_io_loop+0x194/0x2c0 [obdclass]          
 [c0000003ee34fa30] [800000000676a278] .ll_file_io_generic+0x498/0x670 [lustre] 
 [c0000003ee34fb30] [800000000676a8d4] .ll_file_aio_write+0x1d4/0x3a0 [lustre]  
 [c0000003ee34fc00] [800000000676abf0] .ll_file_write+0x150/0x320 [lustre]         
 [c0000003ee34fce0] [c0000000000d5968] .vfs_write+0xd0/0x1c4                       
 [c0000003ee34fd80] [c0000000000d5b58] .SyS_write+0x54/0x98                        
 [c0000003ee34fe30] [c000000000000580] syscall_exit+0x0/0x2c

The specific line it's stuck at is here:

(gdb) l *osc_enter_cache+0x880
0x47f50 is in osc_enter_cache (/builddir/build/BUILD/lustre-2.3.56/lustre/osc/osc_cache.c:1526).
1521    /builddir/build/BUILD/lustre-2.3.56/lustre/osc/osc_cache.c: No such file or directory.
        in /builddir/build/BUILD/lustre-2.3.56/lustre/osc/osc_cache.c
1523                 CDEBUG(D_CACHE, "%s: sleeping for cache space @ %p for %p\n",   
1524                        cli->cl_import->imp_obd->obd_name, &ocw, oap);           
1525                                                                                 
1526                 rc = l_wait_event(ocw.ocw_waitq,                                
1527                                   cfs_list_empty(&ocw.ocw_entry), &lwi);        
1528                                                                                 
1529                 client_obd_list_lock(&cli->cl_loi_list_lock);                   
1530                 cfs_list_del_init(&ocw.ocw_entry); 

There were specific OSCs on a couple of the problem clients showing outstanding dirty pages for a significant amount of time:

seqlac2@root:pdsh -w vulcanio[42-43] 'grep -v 0 /proc/fs/lustre/osc/*/cur_dirty_bytes' | dshbak -c
----------------                                                                
vulcanio42                                                                      
----------------                                                                
/proc/fs/lustre/osc/ls1-OST0026-osc-c0000003c1100000/cur_dirty_bytes:65536         
/proc/fs/lustre/osc/ls1-OST0027-osc-c0000003c1100000/cur_dirty_bytes:65536         
/proc/fs/lustre/osc/ls1-OST0028-osc-c0000003c1100000/cur_dirty_bytes:65536         
/proc/fs/lustre/osc/ls1-OST002b-osc-c0000003c1100000/cur_dirty_bytes:65536         
----------------                                                                
vulcanio43                                                                      
----------------                                                                
/proc/fs/lustre/osc/ls1-OST0053-osc-c0000003e336b000/cur_dirty_bytes:65536

Manually dropping caches on the problem clients seems to have cleared things up:

# echo 3 > /proc/sys/vm/drop_caches

In case it is useful, attached is a dump of the lustre pages taken prior to dropping caches on one of the problem clients:

# cat /proc/fs/lustre/llite/ls1-c0000003e336b000/dump_page_cache > vulcanio43-dump_page_cache-`date +%s`

I also ran the same command after dropping caches, but the file was empty, so there's no reason to post it.



 Comments   
Comment by Prakash Surya (Inactive) [ 04/Jan/13 ]

I should note we're running 2.3.57-2chaos on these clients.

Comment by Peter Jones [ 05/Jan/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 06/Jan/13 ]

In osc_enter_cache():

                rc = l_wait_event(ocw.ocw_waitq,
                                  cfs_list_empty(&ocw.ocw_entry), &lwi);

I think we should take cl_loi_list_lock when checking the cfs_list_empty(&ocw.ocw_entry), otherwise, the wakeup could be missed.

Comment by Niu Yawei (Inactive) [ 06/Jan/13 ]

http://review.whamcloud.com/4963

Comment by Christopher Morrone [ 07/Jan/13 ]

Did you do any sanity testing on 4963? We have testing time tomorrow on Sequoia, and I'm willing to pull it in if you have done some minimal testing. Otherwise I would wait for it to run through autotest and maybe test it on Thursday.

Comment by Jinshan Xiong (Inactive) [ 07/Jan/13 ]

Apparently this problem is that a dirty page couldn't be written out so that there is no I/O RPC between client and OST, this caused the writing process kept waiting for grants.

I don't know why the dirty page was not written back. What are the value of /proc/sys/vm/dirty_XXX on the client node?

Comment by Niu Yawei (Inactive) [ 07/Jan/13 ]

Chris, please hold on. On second thought, I realized that 4963 might not be the real fix. Thanks.

Comment by Niu Yawei (Inactive) [ 09/Jan/13 ]

I think the wakeup condition in osc_enter_cache() is problematic: It checks only if the ocw is removed from the waiting list, however, in the wakup function osc_wake_cache_waiters(), it's possible return without removing the ocw if the global obd_dirty_pages is too high.

I think we'd changed the wakup condition as before: list_empty(&ocw->ocw_entry) || rpcs_in_flight(cli) == 0, then if the obd_dirty_pages is too high & the dirty pages of this osc has been flushed, osc_enter_cache() will not go to sleep.

Comment by Jinshan Xiong (Inactive) [ 10/Jan/13 ]

not sure I understand, if there is already too much dirty_pages, it doesn't help anyway

Comment by Niu Yawei (Inactive) [ 10/Jan/13 ]

not sure I understand, if there is already too much dirty_pages, it doesn't help anyway

If there are too many dirty pages (for the whole client but not this specific osc, obd_dirty_pages > obd_max_dirty_pages), osc_enter_cache() should return -EDQUOT to make this osc to start sync write, however with current code, osc_enter_cache() will go to sleep. (see osc_wake_cache_waiters(), no wakeup if obd_dirty_pages is too high)

Comment by Prakash Surya (Inactive) [ 10/Jan/13 ]

If there are too many dirty pages (for the whole client but not this specific osc, obd_dirty_pages > obd_max_dirty_pages), osc_enter_cache() should return -EDQUOT to make this osc to start sync write

I don't see how the updated patch will cause osc_enter_cache() to return -EDQUOT in this case. If obd_dirty_pages > obd_max_dirty_pages then ocw.ocw_rc will not get set to -EDQUOT, right? So then, how would osc_enter_cache() return -EDQUOT?

Comment by Niu Yawei (Inactive) [ 10/Jan/13 ]

I don't see how the updated patch will cause osc_enter_cache() to return -EDQUOT in this case. If obd_dirty_pages > obd_max_dirty_pages then ocw.ocw_rc will not get set to -EDQUOT, right? So then, how would osc_enter_cache() return -EDQUOT?

I think you are right, we shouldn't override rc by ocw_rc in such case.

Comment by Prakash Surya (Inactive) [ 16/Jan/13 ]

Revision 5 looks like it might fix the issue. I'll pull it in to our branch for testing once it goes through Maloo without issues.

Comment by Prakash Surya (Inactive) [ 16/Jan/13 ]

I don't know why the dirty page was not written back. What are the value of /proc/sys/vm/dirty_XXX on the client node?

I haven't had access to a system hung in this state, but here's the values from one of the nodes:

$ grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500
Comment by Niu Yawei (Inactive) [ 21/Jan/13 ]

Seems the patchset 5 reveals another problem:

In osc_enter_cache(), it's going to wait for grant when cli->cl_dirty > 0, however osc_io_unplug() doesn't tirgger any IO because of oo_nr_writes == 0. I think there should be something wrong in the new extent code. Jingshan, any idea?

It can be reproduced by replay-single test_88.

Comment by Jinshan Xiong (Inactive) [ 21/Jan/13 ]

In osc_enter_cache(), it's going to wait for grant when cli->cl_dirty > 0, however osc_io_unplug() doesn't tirgger any IO because of oo_nr_writes == 0. I think there should be something wrong in the new extent code. Jingshan, any idea?

This is possible. If an extent is in OES_ACTIVE state, cl_dirty will be greater than 0 but oo_nr_writes can be 0. However, when the IO is finished, osc_extent_release() will detect that someone is waiting for grant and then unplug the queue immediately.

Comment by Niu Yawei (Inactive) [ 22/Jan/13 ]

I see the problem, in osc_enter_cache():

 while (cli->cl_dirty > 0 || cli->cl_w_in_flight > 0) {
   osc_io_unplug();
   wait_for_grant();
 }

Actually we shouldn't use 'while' here, because osc_io_unplug() can't trigger dirty flush for the object which isn't on the cl_loi_ready_list or cl_loi_hp_ready_list list (they do have few dirty pages, but not optimal to write out yet), in such case, I think we'd simply return -EDQUOT but not wait in the loop until the dirty pages were flushed.

Comment by Jinshan Xiong (Inactive) [ 22/Jan/13 ]

I don't know what the root cause is but apparently the above comment is incorrect because the queue will be unplugged no matter how many dirty pages there are, if there is cache waiters.

Comment by Niu Yawei (Inactive) [ 22/Jan/13 ]

Xiong, indeed, my previous comment was wrong. The real problem is that we don't have any API to trigger dirty flush on all objects now, we were using osc_check_rpcs() for this purpose before, but now it's wrapped in osc_io_unplug0(), and unfortunately, what osc_io_unplug0() does is: If current object needs be flushed, flush all dirty object, otherwise, do nothing.

So if the current object in osc_enter_cache() has no dirty pages (it's the first page to be put in cache), dirty flush will not be triggered by osc_io_unplug().

Comment by Niu Yawei (Inactive) [ 22/Jan/13 ]

I realized that passing NULL to osc_io_unplug() can trigger dirty flush on all objects, will cook a new fix.

Comment by Niu Yawei (Inactive) [ 31/Jan/13 ]

Prakash, does the patch fix your problem? Thanks.

Comment by Christopher Morrone [ 31/Jan/13 ]

We don't know yet. Other bugs keep preventing reasonable testing of the LU-2576 patch. I see it landed to master. My hope is to get a good test run in today and leave a version of lustre with the LU-2576 patch installed for users to try and see if things improve.

Comment by Christopher Morrone [ 01/Feb/13 ]

Testing for this issue went well yesterday. We'll see how the patch holds up to real users over the next week, but I am cautiously optimistic that the problem is fixed.

Comment by Jodi Levi (Inactive) [ 15/Feb/13 ]

Has the testing continued to be successful and we can close this ticket?

Comment by Christopher Morrone [ 15/Feb/13 ]

Yes, so far no repeats of this bug.

Comment by Niu Yawei (Inactive) [ 15/Feb/13 ]

patch landed for 2.4

Comment by Lukasz Flis [ 14/Sep/13 ]

Dear All,

Has this patch (#7) been really landed for 2.4 and 2.4.1-RC2?
http://review.whamcloud.com/#/c/4963/

We have observed the issue for 2.4.0 and 2.4.1-RC2 so looks like the problem is still there.

Here's our process stack:
Sep 14 13:40:12 n0269-g6l kernel: ost_check S 0000000000000005 0 10821 1 0x00000080
Sep 14 13:40:12 n0269-g6l kernel: ffff880127051548 0000000000000086 0000000000000000 0000000000000000
Sep 14 13:40:12 n0269-g6l kernel: 0000000000000000 0000000000000082 ffff880127051528 ffffffff8106327e
Sep 14 13:40:12 n0269-g6l kernel: ffff88020d5a3af8 ffff880127051fd8 000000000000fb88 ffff88020d5a3af8
Sep 14 13:40:12 n0269-g6l kernel: Call Trace:
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8106327e>] ? try_to_wake_up+0x24e/0x3e0
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa03236fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa08d09ea>] osc_enter_cache+0x85a/0xb40 [osc]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0338917>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa08d7fd1>] osc_queue_async_io+0x1081/0x1c18 [osc]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8121ba89>] ? security_capable+0x29/0x30
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8107baba>] ? capable+0x2a/0x60
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa04841c5>] ? cl_page_slice_add+0x55/0x140 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa08bd2ec>] ? osc_page_init+0xec/0x890 [osc]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa08be2e9>] osc_page_cache_add+0xc9/0x1d0 [osc]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa04848af>] cl_page_cache_add+0x7f/0x2a0 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa094e335>] lov_page_cache_add+0x85/0x200 [lov]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa04848af>] cl_page_cache_add+0x7f/0x2a0 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa04889fc>] ? cl_page_find0+0x44c/0x810 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0481439>] ? cl_env_hops_keycmp+0x19/0x70 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa03389bd>] ? cfs_hash_bd_lookup_intent+0xdd/0x130 [libcfs]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0a26225>] vvp_io_commit_write+0x3e5/0x5b0 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0493cbd>] cl_io_commit_write+0xad/0x1d0 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa09fb37e>] ll_commit_write+0xee/0x320 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0a13650>] ll_write_end+0x30/0x60 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8111a81a>] generic_file_buffered_write+0x18a/0x2e0
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff81075887>] ? current_fs_time+0x27/0x30
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8111c210>] __generic_file_aio_write+0x260/0x490
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8111c4c8>] generic_file_aio_write+0x88/0x100
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0a283db>] vvp_io_write_start+0xcb/0x2e0 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa049072a>] cl_io_start+0x6a/0x140 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa0494e64>] cl_io_loop+0xb4/0x1b0 [obdclass]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa09ce9e0>] ll_file_io_generic+0x450/0x600 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa09cf922>] ll_file_aio_write+0x142/0x2c0 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffffa09cfc0c>] ll_file_write+0x16c/0x2a0 [lustre]
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff81181368>] vfs_write+0xb8/0x1a0
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff81181c61>] sys_write+0x51/0x90
Sep 14 13:40:12 n0269-g6l kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Unfortunately dropping caches doesn't unblock the process.
Only way to get ost communication to work again is to deactivate and reactivate osc

[root@n0269-g6l ~]# echo 0 > /proc/fs/lustre/osc/scratch-OST0003-osc-ffff8802126ba400/active
[root@n0269-g6l ~]# echo 1 > /proc/fs/lustre/osc/scratch-OST0003-osc-ffff8802126ba400/active

Could you please have a look?

In case more debugging info is needed please let us know. We still keep some
clients hanging for debugging purposes.

Best Regards

Lukasz Flis
ACC Cyfronet

Comment by Jinshan Xiong (Inactive) [ 14/Sep/13 ]

Hi Lukasz,

Do you have http://review.whamcloud.com/6554 in your tree? It looks like we need to cherry pick this patch into b2_4 as well.

Comment by Jinshan Xiong (Inactive) [ 14/Sep/13 ]

patch for b2_4 is at: http://review.whamcloud.com/7657

Comment by Lukasz Flis [ 14/Sep/13 ]

Hi Jinshan,

Thank you very much for pointing this one out and for patch for 2.4
we'll give it a try ASAP. There was no sign of this change in 2.4.0 and 2.4.1RC2

I see that 2.4.1 is ready for release (packages are ready and published) - maybe it's worth
to consider holding (or updating) release a bit as this bug can manifest itself in few hours
(i have an impression that oss recoveries raise the chance of getting this one)

Thank you very much for the quick patch. I'll try to give a feedback asap


Lukasz Flis

Generated at Sat Feb 10 01:26:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.