[LU-5138] hang in osc_lru_reserve despite recoverable state Created: 02/Jun/14  Updated: 17/Nov/14  Resolved: 17/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephen Champion Assignee: Peter Jones
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

patch for LU-4856 + without a patch for LU-5123


Attachments: Text File debug.txt     Text File notes.txt    
Severity: 3
Rank (Obsolete): 14168

 Description   

This process has been waiting in osc_lru_reserve for a very long time:

PID: 22025 TASK: ffff88017aeba480 CPU: 5 COMMAND: "reads"
#0 [ffff88017b2ff838] schedule at ffffffff8145ec7b
#1 [ffff88017b2ff980] osc_lru_reserve at ffffffffa0e16ee5 [osc]
#2 [ffff88017b2ffa00] osc_page_init at ffffffffa0e1710d [osc]
#3 [ffff88017b2ffa40] lov_page_init_raid0 at ffffffffa0ea48b0 [lov]
#4 [ffff88017b2ffaa0] cl_page_alloc at ffffffffa0aae632 [obdclass]
#5 [ffff88017b2ffae0] cl_page_find at ffffffffa0aae91b [obdclass]
#6 [ffff88017b2ffb30] ll_write_begin at ffffffffa0f96f8d [lustre]
#7 [ffff88017b2ffb90] generic_perform_write at ffffffff810f8242
#8 [ffff88017b2ffc10] generic_file_buffered_write at ffffffff810f83a1
#9 [ffff88017b2ffc60] __generic_file_aio_write at ffffffff810fb336
#10 [ffff88017b2ffd10] generic_file_aio_write at ffffffff810fb57c
#11 [ffff88017b2ffd50] vvp_io_write_start at ffffffffa0faae48 [lustre]
#12 [ffff88017b2ffda0] cl_io_start at ffffffffa0ab65f9 [obdclass]
#13 [ffff88017b2ffdd0] cl_io_loop at ffffffffa0aba123 [obdclass]
#14 [ffff88017b2ffe00] ll_file_io_generic at ffffffffa0f46af1 [lustre]
#15 [ffff88017b2ffe70] ll_file_aio_write at ffffffffa0f47037 [lustre]
#16 [ffff88017b2ffec0] ll_file_write at ffffffffa0f47a00 [lustre]
#17 [ffff88017b2fff10] vfs_write at ffffffff8115aeae
#18 [ffff88017b2fff40] sys_write at ffffffff8115b023
#19 [ffff88017b2fff80] system_call_fastpath at ffffffff81468d92
RIP: 00002aaaaad99630 RSP: 00007fffffffc568 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffffffff81468d92 RCX: 00007fffffffc510
RDX: 0000000000010000 RSI: 0000000000603040 RDI: 0000000000000003
RBP: 0000000000010000 R8: 0000000000000000 R9: 0101010101010101
R10: 00007fffffffc3b0 R11: 0000000000000246 R12: 0000000000010000
R13: 0000000000000001 R14: 00000000063b0000 R15: 00000000063c0000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

While testing for LU-4856, the bug described in LU-5123 caused sanity 101a to run with ccc_lru_max = 32 (pages), I have not tried, but it should be possible to reproduce this in master by modifying 101a to set max_dirty_mb to 128k.

This is a pathological condition, but I think it exposed a real bug. Namely, it appears that the wakeup from the sleep in osc_lru_reserve can be incidental - causes by another process that just happens to do something that triggers an osc_lru_shrink, rather than the deliberate and specific process of the conditions which caused the sleep being addressed when it becomes possible to do so.

I have a core and debug log from a system in this state, and will attach the debug log, and paste my notes in a comment.



 Comments   
Comment by Stephen Champion [ 02/Jun/14 ]

The last we hear from the hung process is that it has no free pages. It goes to sleep immediately afterwards. At this point in time. there are no free slots. All lru items are busy:

00000008:00000020:5.0:1400133972.928506:0:22025:0:(osc_page.c:833:osc_lru_reclaim()) accfs1-OST0002-osc-ffff88022017e000: cli ffff880431ea5538 no free slots, pages: 0, busy: 30.
00000008:00000020:5.0:1400133972.928507:0:22025:0:(osc_page.c:849:osc_lru_reclaim()) accfs1-OST0003-osc-ffff88022017e000: cli ffff8803653cd578 LRU pages: 0, busy: 0.
00000008:00000020:5.0:1400133972.928509:0:22025:0:(osc_page.c:849:osc_lru_reclaim()) accfs1-OST0001-osc-ffff88022017e000: cli ffff8804316575f8 LRU pages: 0, busy: 2.
00000008:00000020:5.0:1400133972.928510:0:22025:0:(osc_page.c:849:osc_lru_reclaim()) accfs1-OST0000-osc-ffff88022017e000: cli ffff880430d09538 LRU pages: 0, busy: 0.
00000020:00001000:5.0:1400133972.928511:0:22025:0:(cl_object.c:966:cl_env_put()) 1@ffff880179bdee58
00000008:00000020:5.0:1400133972.928513:0:22025:0:(osc_page.c:867:osc_lru_reclaim()) accfs1-OST0000-osc-ffff88022017e000: cli ffff880430d09538 freed 0 pages.

Then a sixteen page extent is completed.

00000008:00000020:12.0:1400134007.925975:0:19741:0:(osc_cache.c:810:osc_extent_finish()) extent ffff8803803091e0@

{[0 -> 15/255], [2|0|-|rpc|wiuY|ffff880352ecfc48], [65536|16|+|-|ffff8803823a0660|256|ffff880166486640]}

extent finished.

And the lru_queue work is requested for ptlrpc_0

00000100:00000040:12.0:1400134007.925983:0:19741:0:(ptlrpcd.c:271:ptlrpcd_add_req()) @@@ add req [ffff880347956000] to pc [ptlrpcd_0:0] req@ffff880347956000 x1468144665523584/t0(0) o-1->accfs1-OST0002-osc-ffff88022017e000@192.168.21.8@o2ib1:0/0 lens 0/0 e 0 to 0 dl 1400134027 ref 2 fl Interpret:N/ffffffff/ffffffff rc 0/-1

The req is processed

00000008:00000020:8.0:1400134007.926058:0:19727:0:(osc_page.c:568:lru_queue_work()) Run LRU work for client obd ffff880431ea5538.

It goes to shrink

00000008:00000001:8.0:1400134007.926061:0:19727:0:(osc_page.c:687:osc_lru_shrink()) Process entered

It leaves shrink with nothing

00000008:00000001:8.0:1400134007.926066:0:19727:0:(osc_page.c:801:osc_lru_shrink()) Process leaving (rc=0 : 0 : 0)

And finishes the req

00000008:00000001:8.0:1400134007.926068:0:19727:0:(osc_page.c:573:lru_queue_work()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:8.0:1400134007.926069:0:19727:0:(client.c:1888:ptlrpc_check_set()) Process leaving (rc=1 : 1 : 1)

It should of had 16 pages to shrink, if they are not referenced.

Currently, all of the pages have a ref count of 1, and would be shrunk. This may not have been the case the last time we called the shrink, though...

The last time we called shrink was
00000008:00000001:8.0:1400134007.926066:0:19727:0:(osc_page.c:801:osc_lru_shrink()) Process leaving (rc=0 : 0 : 0)

Since then, we've put 32 pages.

00000020:00000001:12.0:1400134007.926074:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff8803772e6800[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926075:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926103:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff8803772e6400[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926104:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926130:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff88036f5c8000[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926131:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926157:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff88036f5c8800[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926158:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926185:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff88036f5c8400[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926185:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926212:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff880377565c00[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926213:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926239:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff8803820d9800[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926239:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926266:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff88037856b000[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926267:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926293:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff880380d12400[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926294:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926320:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff8803811ef400[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926321:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926347:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff8803811ef000[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926348:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926374:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff88037f49f800[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926375:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926401:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff880362cf4000[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926402:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926428:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff880376c65c00[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926429:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926455:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff880386096000[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926456:0:19741:0:(cl_page.c:385:cl_page_put()) 2
00000020:00000001:12.0:1400134007.926482:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff880431d7e800[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0]
00000020:00000001:12.0:1400134007.926483:0:19741:0:(cl_page.c:385:cl_page_put()) 2

And all of them had a refcnt of 2 at the time, so they were unshrinkable at the last shrink, are now shrinkable, but nobody has called shrink since the pages were put...

So how did put get called?

Here is the last call exiting, followed by it's call chain:

00000020:00000001:12.0:1400134007.926483:0:19741:0:(cl_page.c:400:cl_page_put()) Process leaving
00000020:00000001:12.0:1400134007.926484:0:19741:0:(cl_page.c:985:cl_page_completion()) Process leaving
00000008:00000001:12.0:1400134007.926484:0:19741:0:(osc_cache.c:1316:osc_completion()) Process leaving (rc=0 : 0 : 0)
00000008:00000001:12.0:1400134007.926486:0:19741:0:(osc_cache.c:1896:osc_ap_completion()) Process leaving
[ ... ]
00000008:00000001:12.0:1400134007.926499:0:19741:0:(osc_cache.c:851:osc_extent_finish()) Process leaving (rc=0 : 0 : 0)

So... osc_extent_finish() called osc_lru_add_batch, responsible for moving pages from the busy list the regular list - then scheduling a shrink - right off the bat. Since then, the pages have been put, and could be shrunk, but nobody has requested a shrink.

That seems wrong to me.

Comment by Stephen Champion [ 02/Jun/14 ]

Extra notes, with structure dumps, disassembly, etc.

Comment by Jinshan Xiong (Inactive) [ 02/Jun/14 ]

Indeed, we should schedule a shrink outside of osc_lru_add_batch(), it should be in osc_extent_finish() after osc_ap_completion() is called.

Can you please create a patch for this, Stephen?

Comment by Stephen Champion [ 02/Jun/14 ]

The other part is adding a test, which should just be a modified sanity 101a. I may be able to cook something up this week.

Do you have an opinion on whether the shrink should be inline or a via ptlrpcd request? I'm thinking this can be inline:

if osc_cache_too_much
osc_lru_shrink

but there may be advantages to queuing a request instead:

if osc_cache_too_much
ptlrpcd_queue_work cl_lru_work

Comment by Stephen Champion [ 14/Jun/14 ]

Just an update:

I wrote a test for this, which worked. It also failed (in a different way) with the patch I wrote.
The patch I wrote also failed on another test... So there's a good bit of work to do understanding these cases.

I intend to get back this, but have some pressing customer issues to deal with.

Comment by John Fuchs-Chesney (Inactive) [ 18/Jul/14 ]

Hello Stephen,
Is this still a live issue for you?
If so, can you please give us an update on where you are with it.
Many thanks,
~ jfc.

Comment by Stephen Champion [ 18/Jul/14 ]

This is not really a problem for us - it's just a bug I stumbled across. I wrote a test that triggers another bug that I have not fully investigated, and a fix that crashes in a later acceptance test.

Our customers suffer from too much memory, not too little, so I am unlikely to be able to spend additional time on this. I'd be happy to pass my incomplete work along for someone to pick up.

Comment by John Fuchs-Chesney (Inactive) [ 25/Jul/14 ]

Thank you Stephen.
We'll leave the ticket open for the time being and please let us know if it rises in priority in the future.
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 17/Nov/14 ]

We can reopen this ticket if this issue becomes a priority.

~ jfc.

Generated at Sat Feb 10 01:48:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.