[LU-5138] hang in osc_lru_reserve despite recoverable state Created: 02/Jun/14 Updated: 17/Nov/14 Resolved: 17/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephen Champion | Assignee: | Peter Jones |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14168 |
| Description |
|
This process has been waiting in osc_lru_reserve for a very long time: PID: 22025 TASK: ffff88017aeba480 CPU: 5 COMMAND: "reads" While testing for This is a pathological condition, but I think it exposed a real bug. Namely, it appears that the wakeup from the sleep in osc_lru_reserve can be incidental - causes by another process that just happens to do something that triggers an osc_lru_shrink, rather than the deliberate and specific process of the conditions which caused the sleep being addressed when it becomes possible to do so. I have a core and debug log from a system in this state, and will attach the debug log, and paste my notes in a comment. |
| Comments |
| Comment by Stephen Champion [ 02/Jun/14 ] |
|
The last we hear from the hung process is that it has no free pages. It goes to sleep immediately afterwards. At this point in time. there are no free slots. All lru items are busy: 00000008:00000020:5.0:1400133972.928506:0:22025:0:(osc_page.c:833:osc_lru_reclaim()) accfs1-OST0002-osc-ffff88022017e000: cli ffff880431ea5538 no free slots, pages: 0, busy: 30. Then a sixteen page extent is completed. 00000008:00000020:12.0:1400134007.925975:0:19741:0:(osc_cache.c:810:osc_extent_finish()) extent ffff8803803091e0@ {[0 -> 15/255], [2|0|-|rpc|wiuY|ffff880352ecfc48], [65536|16|+|-|ffff8803823a0660|256|ffff880166486640]}extent finished. And the lru_queue work is requested for ptlrpc_0 00000100:00000040:12.0:1400134007.925983:0:19741:0:(ptlrpcd.c:271:ptlrpcd_add_req()) @@@ add req [ffff880347956000] to pc [ptlrpcd_0:0] req@ffff880347956000 x1468144665523584/t0(0) o-1->accfs1-OST0002-osc-ffff88022017e000@192.168.21.8@o2ib1:0/0 lens 0/0 e 0 to 0 dl 1400134027 ref 2 fl Interpret:N/ffffffff/ffffffff rc 0/-1 The req is processed 00000008:00000020:8.0:1400134007.926058:0:19727:0:(osc_page.c:568:lru_queue_work()) Run LRU work for client obd ffff880431ea5538. It goes to shrink 00000008:00000001:8.0:1400134007.926061:0:19727:0:(osc_page.c:687:osc_lru_shrink()) Process entered It leaves shrink with nothing 00000008:00000001:8.0:1400134007.926066:0:19727:0:(osc_page.c:801:osc_lru_shrink()) Process leaving (rc=0 : 0 : 0) And finishes the req 00000008:00000001:8.0:1400134007.926068:0:19727:0:(osc_page.c:573:lru_queue_work()) Process leaving (rc=0 : 0 : 0) It should of had 16 pages to shrink, if they are not referenced. Currently, all of the pages have a ref count of 1, and would be shrunk. This may not have been the case the last time we called the shrink, though... The last time we called shrink was Since then, we've put 32 pages. 00000020:00000001:12.0:1400134007.926074:0:19741:0:(cl_page.c:385:cl_page_put()) page@ffff8803772e6800[2 ffff880381c2c0f0 0 0 1 (null) (null) 0x0] And all of them had a refcnt of 2 at the time, so they were unshrinkable at the last shrink, are now shrinkable, but nobody has called shrink since the pages were put... So how did put get called? Here is the last call exiting, followed by it's call chain: 00000020:00000001:12.0:1400134007.926483:0:19741:0:(cl_page.c:400:cl_page_put()) Process leaving So... osc_extent_finish() called osc_lru_add_batch, responsible for moving pages from the busy list the regular list - then scheduling a shrink - right off the bat. Since then, the pages have been put, and could be shrunk, but nobody has requested a shrink. That seems wrong to me. |
| Comment by Stephen Champion [ 02/Jun/14 ] |
|
Extra notes, with structure dumps, disassembly, etc. |
| Comment by Jinshan Xiong (Inactive) [ 02/Jun/14 ] |
|
Indeed, we should schedule a shrink outside of osc_lru_add_batch(), it should be in osc_extent_finish() after osc_ap_completion() is called. Can you please create a patch for this, Stephen? |
| Comment by Stephen Champion [ 02/Jun/14 ] |
|
The other part is adding a test, which should just be a modified sanity 101a. I may be able to cook something up this week. Do you have an opinion on whether the shrink should be inline or a via ptlrpcd request? I'm thinking this can be inline: if osc_cache_too_much but there may be advantages to queuing a request instead: if osc_cache_too_much |
| Comment by Stephen Champion [ 14/Jun/14 ] |
|
Just an update: I wrote a test for this, which worked. It also failed (in a different way) with the patch I wrote. I intend to get back this, but have some pressing customer issues to deal with. |
| Comment by John Fuchs-Chesney (Inactive) [ 18/Jul/14 ] |
|
Hello Stephen, |
| Comment by Stephen Champion [ 18/Jul/14 ] |
|
This is not really a problem for us - it's just a bug I stumbled across. I wrote a test that triggers another bug that I have not fully investigated, and a fix that crashes in a later acceptance test. Our customers suffer from too much memory, not too little, so I am unlikely to be able to spend additional time on this. I'd be happy to pass my incomplete work along for someone to pick up. |
| Comment by John Fuchs-Chesney (Inactive) [ 25/Jul/14 ] |
|
Thank you Stephen. |
| Comment by John Fuchs-Chesney (Inactive) [ 17/Nov/14 ] |
|
We can reopen this ticket if this issue becomes a priority. ~ jfc. |