[LU-5805] tgt_recov blocked and "waking for gap in transno" Created: 24/Oct/14 Updated: 18/Jul/16 Resolved: 25/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Oleg Drokin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 16286 | ||||||||||||||||
| Description |
|
We are testing our 2.5.3-based branch using osd-zfs. The clients, lnet router, and server nodes all had version 2.5.3-1chaos installed (see github.com/chaos/lustre). On the recommendation from I installed this on the servers only. At the time, we had the SWL IO test running (mixture of ior, mdtest, simul, etc. all running at the same time). I then rebooted just the servers onto the test build. The OSS nodes show lots of startup error messages, many that we didn't see without this new patch. Granted, it was just one time. See attached file named simply "log". This is the console log from one of the OSS nodes. Here's my initial view of what is going on: The OSS nodes boot significantly faster than the MGS/MDS node. We have retry set to 32. I suspect that this noise is related to the MGS not yet having started: 2014-10-24 14:21:38 LustreError: 7421:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880fef274800 x1482881169883144/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1 69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 2014-10-24 14:21:38 LustreError: 7421:0:(obd_mount_server.c:1120:server_register_target()) lcy-OST0001: error registering with the MGS: rc = -5 (not fatal) The MGS/MDS node doesn't start mounting the MGS and MDS devices until 14:25:47 and 14:25:47, respectively. The MDS enters recovery at this time: 2014-10-24 14:26:23 zwicky-lcy-mds1 login: Lustre: lcy-MDT0000: Will be in recovery for at least 5:00, or until 134 clients reconnect. So there are at least 4 problems here. We may need to split them up into separate subtickets:
|
| Comments |
| Comment by Christopher Morrone [ 25/Oct/14 ] |
|
We need to triage these various startup problems ASAP. We need to know if we can start rolling this out into production in a week, and to do that we need to tag a local lustre release on Oct 28. I am not feeling terribly confident about 2.5's recovery situation at this point. |
| Comment by Oleg Drokin [ 25/Oct/14 ] |
|
1. the send limit expired message is normal considering mgs is not up for a prolonged period of time (client would just refuse to start at such time since it cannot get a config our of anywhere).
|
| Comment by Christopher Morrone [ 27/Oct/14 ] |
Sure, we could also hop on one foot, juggle, and chew bubblegum while restarting Lustre. But we don't want to. We can add this to the "Things That Suck About Lustre" list, but not let it hold up our 2.5 roll out.
Right. Lustre is behaving badly in the kernel. Has been for quite some time. I agree. We would like to see it fixed.
That sounds like a rather worrying bug. Do you think that http://review.whamcloud.com/12365 is causing that? Should I take that back out? Would you recommend running the code as is at a major HPC center?
No, I don't see a connection. After a weekend of testing, it is clear that the "Error -2 syncing data on lock cancel" is an ongoing problem: 2014-10-27 08:12:52 LustreError: 99105:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 3 previous similar messages 2014-10-27 08:17:34 LustreError: 8795:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:25:46 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:25:46 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 1 previous similar message 2014-10-27 08:29:34 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:29:34 LustreError: 27295:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:29:34 LustreError: 86007:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:41:25 LustreError: 8001:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:41:25 LustreError: 8001:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 1 previous similar message 2014-10-27 09:09:51 LustreError: 8064:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:14:45 LustreError: 8354:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:28:07 LustreError: 7704:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:54:12 LustreError: 8301:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:54:12 LustreError: 8562:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:54:12 LustreError: 8562:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 2 previous similar messages I am thinking that I need to take out the http://review.whamcloud.com/12365 patch. |
| Comment by Oleg Drokin [ 28/Oct/14 ] |
|
Hm, the patch 12365 would certainly play with timeouts, but it's not obvious how it would cause message #3. But if you never saw the problem without this patch (I understand you never really deployed 2.5 before too so likely you do not have enough data to look back at.) Also I am not sure that we can refer to it as "bug", the issue can occur genuinely. As for the patch itself, basically it's adding up time to requests in recovery before dropping them on the floor: +++ b/lustre/ptlrpc/service.c @@ -1302,7 +1302,9 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_reques * during the recovery period send at least 4 early replies, * spacing them every at_extra if we can. at_estimate should * always equal this fixed value during recovery. */ - at_measured(&svcpt->scp_at_estimate, min(at_extra, + at_measured(&svcpt->scp_at_estimate, + cfs_time_current_sec() - + req->rq_arrival_time.tv_sec + min(at_extra, req->rq_export->exp_obd->obd_recovery_timeout / 4)) This seems to be really simple and not harmful in any way to me. Also for message #4, this apparently was hit by various people as early as 2.4, sample tickets are For message #1 - I am not sure I really understand what you think would be a better solution. The clients need some way to obtain a configuration (i.e. mgs being up), they would not be able to mount. Are you just unhappy about the noise in messages? But the error is genuine after all, so just plain silencing the message might not make a lot of sense. I filed |
| Comment by Christopher Morrone [ 30/Oct/14 ] |
|
For message #4, we agree that this has been around a while. We have found instances of it under 2.4. Note however that our recent hit was under 2.5.3+, so the No, I don't think we will be able to give you access to this system. But Intel does have access to hyperion. Our new testing guy thought he was on the track of reproducing the issue, but then it suddenly stopped happening. He back tracked to earlier tests that caused it and he wasn't able to see it there either. |
| Comment by Oleg Drokin [ 31/Oct/14 ] |
|
|
| Comment by Christopher Morrone [ 31/Oct/14 ] |
|
It is not terribly reasonable to expect that customers will give full access to their machines, or that the burden is on them to have solid reproducers. You will need to find other ways to make progress on the many issues like this. In the long term that means stronger requirements on code quality, but of course that does not help us today. In the short term, since no one understands the code well enough to have any clue what is happening it probably means that debugging patches are necessary to start the process of learning what is going on. |
| Comment by Oleg Drokin [ 01/Nov/14 ] |
|
I don't think we need a debug patch yet. I envision having full or close to full debug on the OSS where this happens (with debug buffer big enough to contain the whole thing + margin until you can trigger the dump) should be enough. Naturally this requires a reproducer and some time to setup and babysit to get the logs. |
| Comment by Christopher Morrone [ 03/Nov/14 ] |
|
Didn't we just establish that a reproducer is not forthcoming? Additionally, LLNL staff time to babysit and get logs is not in great abundance. You need to find another way to make progress. |
| Comment by Andreas Dilger [ 04/Aug/15 ] |
|
The patch http://review.whamcloud.com/12672 " |
| Comment by D. Marc Stearman (Inactive) [ 19/Feb/16 ] |
|
Is Intel waiting on LLNL for this? |
| Comment by Peter Jones [ 22/Feb/16 ] |
|
I think that a level set is reasonable - which of the initial four reported problems (if any) still manifest themselves on the latest 2.5.5 version in production at LLNL? |
| Comment by Cameron Harr [ 21/Mar/16 ] |
|
We haven't seen this error since 2/17/16. |
| Comment by Peter Jones [ 21/Mar/16 ] |
|
Cameron How frequently was it happening prior to that date and when did your 2.5.5 FE version get rolled out? Peter |
| Comment by Cameron Harr [ 22/Mar/16 ] |
|
Peter, As for frequency prior to that, it doesn't look like we've seen it since last July on a handful of nodes from multiple clusters. I'm in favor of closing. Chris? |
| Comment by Peter Jones [ 25/Apr/16 ] |
|
Thanks Cameron |