Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5805

tgt_recov blocked and "waking for gap in transno"

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.5.3
    • 3
    • 16286

    Description

      We are testing our 2.5.3-based branch using osd-zfs. The clients, lnet router, and server nodes all had version 2.5.3-1chaos installed (see github.com/chaos/lustre).

      On the recommendation from LU-5803, I made a test build of lustre that consists of 2.5.3-1chaos + http://review.whamcloud.com/12365.

      I installed this on the servers only. At the time, we had the SWL IO test running (mixture of ior, mdtest, simul, etc. all running at the same time).

      I then rebooted just the servers onto the test build. The OSS nodes show lots of startup error messages, many that we didn't see without this new patch. Granted, it was just one time.

      See attached file named simply "log". This is the console log from one of the OSS nodes.

      Here's my initial view of what is going on:

      The OSS nodes boot significantly faster than the MGS/MDS node. We have retry set to 32. I suspect that this noise is related to the MGS not yet having started:

      2014-10-24 14:21:38 LustreError: 7421:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880fef274800 x1482881169883144/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1
      69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      2014-10-24 14:21:38 LustreError: 7421:0:(obd_mount_server.c:1120:server_register_target()) lcy-OST0001: error registering with the MGS: rc = -5 (not fatal)
      

      The MGS/MDS node doesn't start mounting the MGS and MDS devices until 14:25:47 and 14:25:47, respectively.

      The MDS enters recovery at this time:

      2014-10-24 14:26:23 zwicky-lcy-mds1 login: Lustre: lcy-MDT0000: Will be in recovery for at least 5:00, or until 134 clients reconnect.
      

      So there are at least 4 problems here. We may need to split them up into separate subtickets:

      1. OSS noise before MGS/MDS has started
      2. tgt_recov "blocked for more than 102 seconds" (some of the 16 OSS nodes did this)
      3. "waking for gap in transno", the MDS and some of the OSS nodes show a swath of these
      4. Many OSS nodes hit "2014-10-24 14:36:23 LustreError: 7479:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel" within a few minutes of recovery being complete

      Attachments

        Issue Links

          Activity

            [LU-5805] tgt_recov blocked and "waking for gap in transno"

            Is Intel waiting on LLNL for this?

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - Is Intel waiting on LLNL for this?

            The patch http://review.whamcloud.com/12672 "LU-5816 target: don't trigger watchdog waiting in recovery" addresses issue #2 and was landed for 2.7.0 and 2.5.4.

            adilger Andreas Dilger added a comment - The patch http://review.whamcloud.com/12672 " LU-5816 target: don't trigger watchdog waiting in recovery" addresses issue #2 and was landed for 2.7.0 and 2.5.4.

            Didn't we just establish that a reproducer is not forthcoming? Additionally, LLNL staff time to babysit and get logs is not in great abundance. You need to find another way to make progress.

            morrone Christopher Morrone (Inactive) added a comment - Didn't we just establish that a reproducer is not forthcoming? Additionally, LLNL staff time to babysit and get logs is not in great abundance. You need to find another way to make progress.
            green Oleg Drokin added a comment -

            I don't think we need a debug patch yet. I envision having full or close to full debug on the OSS where this happens (with debug buffer big enough to contain the whole thing + margin until you can trigger the dump) should be enough. Naturally this requires a reproducer and some time to setup and babysit to get the logs.

            green Oleg Drokin added a comment - I don't think we need a debug patch yet. I envision having full or close to full debug on the OSS where this happens (with debug buffer big enough to contain the whole thing + margin until you can trigger the dump) should be enough. Naturally this requires a reproducer and some time to setup and babysit to get the logs.

            It is not terribly reasonable to expect that customers will give full access to their machines, or that the burden is on them to have solid reproducers. You will need to find other ways to make progress on the many issues like this.

            In the long term that means stronger requirements on code quality, but of course that does not help us today. In the short term, since no one understands the code well enough to have any clue what is happening it probably means that debugging patches are necessary to start the process of learning what is going on.

            morrone Christopher Morrone (Inactive) added a comment - It is not terribly reasonable to expect that customers will give full access to their machines, or that the burden is on them to have solid reproducers. You will need to find other ways to make progress on the many issues like this. In the long term that means stronger requirements on code quality, but of course that does not help us today. In the short term, since no one understands the code well enough to have any clue what is happening it probably means that debugging patches are necessary to start the process of learning what is going on.
            green Oleg Drokin added a comment -

            LU-3421 patch in the end was fixing a hang, not this particular message that was just a (possibly) related symptom, I think.
            Similarly in LU-5652 (that I see now was transformed into a customer closed ticket somehow since I last looked) this message was just noted while other problems were happening with the system.
            While we do have access to Hyperion, what we really need is a way to reliably reproduce the issue (I checked my own logs on my test systems and I saw this message only twice, both on the same day Sep 4th, but on two different systems in different unrelated tests).
            If you can no longer reliably reproduce this too, then there's no point in any access anyway at this stage, and if there's a solid reproducer that works everywhere, then too we can try it under our own control.

            green Oleg Drokin added a comment - LU-3421 patch in the end was fixing a hang, not this particular message that was just a (possibly) related symptom, I think. Similarly in LU-5652 (that I see now was transformed into a customer closed ticket somehow since I last looked) this message was just noted while other problems were happening with the system. While we do have access to Hyperion, what we really need is a way to reliably reproduce the issue (I checked my own logs on my test systems and I saw this message only twice, both on the same day Sep 4th, but on two different systems in different unrelated tests). If you can no longer reliably reproduce this too, then there's no point in any access anyway at this stage, and if there's a solid reproducer that works everywhere, then too we can try it under our own control.

            For message #4, we agree that this has been around a while. We have found instances of it under 2.4.

            Note however that our recent hit was under 2.5.3+, so the LU-3421 solution did not address whatever issue is currently causing it. LU-5653, if that is the correct ticket number, is hidden from me.

            No, I don't think we will be able to give you access to this system. But Intel does have access to hyperion.

            Our new testing guy thought he was on the track of reproducing the issue, but then it suddenly stopped happening. He back tracked to earlier tests that caused it and he wasn't able to see it there either.

            morrone Christopher Morrone (Inactive) added a comment - For message #4, we agree that this has been around a while. We have found instances of it under 2.4. Note however that our recent hit was under 2.5.3+, so the LU-3421 solution did not address whatever issue is currently causing it. LU-5653, if that is the correct ticket number, is hidden from me. No, I don't think we will be able to give you access to this system. But Intel does have access to hyperion. Our new testing guy thought he was on the track of reproducing the issue, but then it suddenly stopped happening. He back tracked to earlier tests that caused it and he wasn't able to see it there either.
            green Oleg Drokin added a comment -

            Hm, the patch 12365 would certainly play with timeouts, but it's not obvious how it would cause message #3. But if you never saw the problem without this patch (I understand you never really deployed 2.5 before too so likely you do not have enough data to look back at.) Also I am not sure that we can refer to it as "bug", the issue can occur genuinely.
            Consider that there's a bunch of requests being handled at about same time and then server crashes, some replies made it out and some did not. If the lower-numbered transaction one did not make it out - you'll see a message like that. Though whenever you care or not I am less sure so perhaps we can just make it to never show in the console after all like we already do for vbr.

            As for the patch itself, basically it's adding up time to requests in recovery before dropping them on the floor:

            +++ b/lustre/ptlrpc/service.c
            @@ -1302,7 +1302,9 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_reques
                             * during the recovery period send at least 4 early replies,
                             * spacing them every at_extra if we can. at_estimate should
                             * always equal this fixed value during recovery. */
            -               at_measured(&svcpt->scp_at_estimate, min(at_extra,
            +               at_measured(&svcpt->scp_at_estimate,
            +                           cfs_time_current_sec() -
            +                           req->rq_arrival_time.tv_sec + min(at_extra,
                                        req->rq_export->exp_obd->obd_recovery_timeout / 4))
            

            This seems to be really simple and not harmful in any way to me.

            Also for message #4, this apparently was hit by various people as early as 2.4, sample tickets are LU-3421 and LU-5653. This is currently believed to be harmless, though we currently lack total understanding of what's going on here. You are able to hit this on a test system, can we somehow get access to investigate it further or if you know how to cause this reliably, that would be good too.

            For message #1 - I am not sure I really understand what you think would be a better solution. The clients need some way to obtain a configuration (i.e. mgs being up), they would not be able to mount. Are you just unhappy about the noise in messages? But the error is genuine after all, so just plain silencing the message might not make a lot of sense.

            I filed LU-5816 for message #2.

            green Oleg Drokin added a comment - Hm, the patch 12365 would certainly play with timeouts, but it's not obvious how it would cause message #3. But if you never saw the problem without this patch (I understand you never really deployed 2.5 before too so likely you do not have enough data to look back at.) Also I am not sure that we can refer to it as "bug", the issue can occur genuinely. Consider that there's a bunch of requests being handled at about same time and then server crashes, some replies made it out and some did not. If the lower-numbered transaction one did not make it out - you'll see a message like that. Though whenever you care or not I am less sure so perhaps we can just make it to never show in the console after all like we already do for vbr. As for the patch itself, basically it's adding up time to requests in recovery before dropping them on the floor: +++ b/lustre/ptlrpc/service.c @@ -1302,7 +1302,9 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_reques * during the recovery period send at least 4 early replies, * spacing them every at_extra if we can. at_estimate should * always equal this fixed value during recovery. */ - at_measured(&svcpt->scp_at_estimate, min(at_extra, + at_measured(&svcpt->scp_at_estimate, + cfs_time_current_sec() - + req->rq_arrival_time.tv_sec + min(at_extra, req->rq_export->exp_obd->obd_recovery_timeout / 4)) This seems to be really simple and not harmful in any way to me. Also for message #4, this apparently was hit by various people as early as 2.4, sample tickets are LU-3421 and LU-5653. This is currently believed to be harmless, though we currently lack total understanding of what's going on here. You are able to hit this on a test system, can we somehow get access to investigate it further or if you know how to cause this reliably, that would be good too. For message #1 - I am not sure I really understand what you think would be a better solution. The clients need some way to obtain a configuration (i.e. mgs being up), they would not be able to mount. Are you just unhappy about the noise in messages? But the error is genuine after all, so just plain silencing the message might not make a lot of sense. I filed LU-5816 for message #2.

            Since the FS is not really usable until MGS starts (for initial client mounts), I wonder if you can just delay ost startup to better match initial mgs starting time?

            Sure, we could also hop on one foot, juggle, and chew bubblegum while restarting Lustre. But we don't want to.

            We can add this to the "Things That Suck About Lustre" list, but not let it hold up our 2.5 roll out.

            the tgt recov blocked message is also normal, it's an artifact of the sleep timer watchdog you configured in the kernel. We probably should explore how to shut it by letting kernel know we are not really stuck yet and the prolonged delay is actually expected

            Right. Lustre is behaving badly in the kernel. Has been for quite some time. I agree. We would like to see it fixed.

            waking for gap in transno. - this one indicates that while all of the expected clients reconnected, some of them failed to present the transaction we think should be next (lost reply to client before server went down?)

            That sounds like a rather worrying bug. Do you think that http://review.whamcloud.com/12365 is causing that? Should I take that back out?

            Would you recommend running the code as is at a major HPC center?

            is the subset of nodes displaying this message same as the nodes displaying message in #3 (ignoring mds)?

            No, I don't see a connection. After a weekend of testing, it is clear that the "Error -2 syncing data on lock cancel" is an ongoing problem:

            2014-10-27 08:12:52 LustreError: 99105:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 3 previous similar messages
            2014-10-27 08:17:34 LustreError: 8795:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 08:25:46 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 08:25:46 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 1 previous similar message
            2014-10-27 08:29:34 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 08:29:34 LustreError: 27295:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 08:29:34 LustreError: 86007:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 08:41:25 LustreError: 8001:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 08:41:25 LustreError: 8001:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 1 previous similar message
            2014-10-27 09:09:51 LustreError: 8064:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 09:14:45 LustreError: 8354:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 09:28:07 LustreError: 7704:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 09:54:12 LustreError: 8301:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 09:54:12 LustreError: 8562:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel
            2014-10-27 09:54:12 LustreError: 8562:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 2 previous similar messages
            

            I am thinking that I need to take out the http://review.whamcloud.com/12365 patch.

            morrone Christopher Morrone (Inactive) added a comment - Since the FS is not really usable until MGS starts (for initial client mounts), I wonder if you can just delay ost startup to better match initial mgs starting time? Sure, we could also hop on one foot, juggle, and chew bubblegum while restarting Lustre. But we don't want to. We can add this to the "Things That Suck About Lustre" list, but not let it hold up our 2.5 roll out. the tgt recov blocked message is also normal, it's an artifact of the sleep timer watchdog you configured in the kernel. We probably should explore how to shut it by letting kernel know we are not really stuck yet and the prolonged delay is actually expected Right. Lustre is behaving badly in the kernel. Has been for quite some time. I agree. We would like to see it fixed. waking for gap in transno. - this one indicates that while all of the expected clients reconnected, some of them failed to present the transaction we think should be next (lost reply to client before server went down?) That sounds like a rather worrying bug. Do you think that http://review.whamcloud.com/12365 is causing that? Should I take that back out? Would you recommend running the code as is at a major HPC center? is the subset of nodes displaying this message same as the nodes displaying message in #3 (ignoring mds)? No, I don't see a connection. After a weekend of testing, it is clear that the "Error -2 syncing data on lock cancel" is an ongoing problem: 2014-10-27 08:12:52 LustreError: 99105:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 3 previous similar messages 2014-10-27 08:17:34 LustreError: 8795:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:25:46 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:25:46 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 1 previous similar message 2014-10-27 08:29:34 LustreError: 7471:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:29:34 LustreError: 27295:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:29:34 LustreError: 86007:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:41:25 LustreError: 8001:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 08:41:25 LustreError: 8001:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 1 previous similar message 2014-10-27 09:09:51 LustreError: 8064:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:14:45 LustreError: 8354:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:28:07 LustreError: 7704:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:54:12 LustreError: 8301:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:54:12 LustreError: 8562:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel 2014-10-27 09:54:12 LustreError: 8562:0:(ost_handler.c:1776:ost_blocking_ast()) Skipped 2 previous similar messages I am thinking that I need to take out the http://review.whamcloud.com/12365 patch.
            green Oleg Drokin added a comment -

            1. the send limit expired message is normal considering mgs is not up for a prolonged period of time (client would just refuse to start at such time since it cannot get a config our of anywhere).

            • Since the FS is not really usable until MGS starts (for initial client mounts), I wonder if you can just delay ost startup to better match initial mgs starting time?
              2. the tgt recov blocked message is also normal, it's an artifact of the sleep timer watchdog you configured in the kernel. We probably should explore how to shut it by letting kernel know we are not really stuck yet and the prolonged delay is actually expected (the recovery timeout).
              3. waking for gap in transno. - this one indicates that while all of the expected clients reconnected, some of them failed to present the transaction we think should be next (lost reply to client before server went down?)
              4. is the subset of nodes displaying this message same as the nodes displaying message in #3 (ignoring mds)? The message means the object went away while a lock on this object was held. potentially could be caused by orphans code if MDS somehow did not remember right what was the last touched object but not quite too far back into the past to trigger the catastrophic fallout reported in LU-5648? If it only happens after recovery and never again, I would suspect that's one of the lead suspicions and the forgetful mds issue needs to be addressed instead.You did not abort recovery in this case, did you?
            green Oleg Drokin added a comment - 1. the send limit expired message is normal considering mgs is not up for a prolonged period of time (client would just refuse to start at such time since it cannot get a config our of anywhere). Since the FS is not really usable until MGS starts (for initial client mounts), I wonder if you can just delay ost startup to better match initial mgs starting time? 2. the tgt recov blocked message is also normal, it's an artifact of the sleep timer watchdog you configured in the kernel. We probably should explore how to shut it by letting kernel know we are not really stuck yet and the prolonged delay is actually expected (the recovery timeout). 3. waking for gap in transno. - this one indicates that while all of the expected clients reconnected, some of them failed to present the transaction we think should be next (lost reply to client before server went down?) 4. is the subset of nodes displaying this message same as the nodes displaying message in #3 (ignoring mds)? The message means the object went away while a lock on this object was held. potentially could be caused by orphans code if MDS somehow did not remember right what was the last touched object but not quite too far back into the past to trigger the catastrophic fallout reported in LU-5648 ? If it only happens after recovery and never again, I would suspect that's one of the lead suspicions and the forgetful mds issue needs to be addressed instead.You did not abort recovery in this case, did you?

            We need to triage these various startup problems ASAP. We need to know if we can start rolling this out into production in a week, and to do that we need to tag a local lustre release on Oct 28. I am not feeling terribly confident about 2.5's recovery situation at this point.

            morrone Christopher Morrone (Inactive) added a comment - We need to triage these various startup problems ASAP. We need to know if we can start rolling this out into production in a week, and to do that we need to tag a local lustre release on Oct 28. I am not feeling terribly confident about 2.5's recovery situation at this point.

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: