Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1663

MDS threads hang for over 725s, causing fail over

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.7
    • Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443
      Centos 5.5
    • 2
    • 4055

    Description

      At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes.

      We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest.

      Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:
      Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s

      This one makes it look like it might be quota related:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff887f9220>] :lnet:LNetPut+0x730/0x840
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff800649fb>] __down+0xc3/0xd8
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff88a29490>] :lquota:dqacq_handler+0x0/0xc20
      ...

      This one looks a little like LU-1395 or LU-1269:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: Call Trace:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888ceb51>] ldlm_resource_add_lock+0xb1/0x180 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e2a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8009987d>] process_timeout+0x0/0x5
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e4555>] ldlm_completion_ast+0x4d5/0x880 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c9709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c4b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e30bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff88c611a7>] enqueue_ordered_locks+0x387/0x4d0 [mds]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e09a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      ...

      Attachments

        1. 09sep.tar.bz2
          8.40 MB
        2. call_traces
          392 kB
        3. kern.log.2013-02-23.gz
          104 kB
        4. kern.log-20120721
          191 kB
        5. ll-1181-decoded.txt.gz
          0.2 kB
        6. log1.bz2
          438 kB
        7. lustre-log.txt.bz2
          4.55 MB
        8. mds1.log
          8.37 MB
        9. mds2.log
          3.45 MB

        Issue Links

          Activity

            [LU-1663] MDS threads hang for over 725s, causing fail over

            We finally got a full vmcore. It's 12GB.. Is there a good place to put it? Alternatively I could run commands on it. What would be the best way to prove that it's the tid wraparound issue?

            Thanks.

            kitwestneat Kit Westneat (Inactive) added a comment - We finally got a full vmcore. It's 12GB.. Is there a good place to put it? Alternatively I could run commands on it. What would be the best way to prove that it's the tid wraparound issue? Thanks.

            It looks like the patch RH took to solve bz735768 is this one:
            https://github.com/torvalds/linux/commit/deeeaf13b291420fe4a4a52606b9fc9128387340

            Oh I just realized that the debian-kernel patch references the two commits already, doh.

            Are there any plans to carry the debian-kernel patch until RH backports it? How risky do you think it is to backport them to RHEL5?

            kitwestneat Kit Westneat (Inactive) added a comment - It looks like the patch RH took to solve bz735768 is this one: https://github.com/torvalds/linux/commit/deeeaf13b291420fe4a4a52606b9fc9128387340 Oh I just realized that the debian-kernel patch references the two commits already, doh. Are there any plans to carry the debian-kernel patch until RH backports it? How risky do you think it is to backport them to RHEL5?
            green Oleg Drokin added a comment -

            I have a high-load test system for lustre regression testing and I am hitting two sorts of jbd issues somewhat frequently, the rh bz735768 and then some jbd thread lockups that I think are related. Some of them are somewhat similar to what you see, I think. So I was just doing some more investigations into that and uncovered that they are apparently known problems if you have high transaction rates (which I certainly do).

            I suspect you can cherry-pick the patches from mainline linux kernel.

            green Oleg Drokin added a comment - I have a high-load test system for lustre regression testing and I am hitting two sorts of jbd issues somewhat frequently, the rh bz735768 and then some jbd thread lockups that I think are related. Some of them are somewhat similar to what you see, I think. So I was just doing some more investigations into that and uncovered that they are apparently known problems if you have high transaction rates (which I certainly do). I suspect you can cherry-pick the patches from mainline linux kernel.

            I don't see that bug fixed in any RHEL5 kernel. Since Redhat no longer breaks out their patches, it seems like it would be difficult to figure out what they did to fix it. I will look at the SRPMS for the jbd stuff to see if I can figure it out. Have you seen the transaction wraparound problem in Lustre before? Are there any workarounds? It seems like we can't be the first to hit it.

            kitwestneat Kit Westneat (Inactive) added a comment - I don't see that bug fixed in any RHEL5 kernel. Since Redhat no longer breaks out their patches, it seems like it would be difficult to figure out what they did to fix it. I will look at the SRPMS for the jbd stuff to see if I can figure it out. Have you seen the transaction wraparound problem in Lustre before? Are there any workarounds? It seems like we can't be the first to hit it.
            green Oleg Drokin added a comment -

            I suspect what you are seeing is a generic jbd2 transaction wraparound problem.
            There were several rounds at RedHat to fix this, previous round here (already part of rhel6.4 kernel, no idea if they backported to rhel5 too): https://bugzilla.redhat.com/show_bug.cgi?id=735768 and then the latest one seems to be unfolding here: http://lists.debian.org/debian-kernel/2013/03/msg00313.html though I don't know when it will make it to rhel kernels either.

            green Oleg Drokin added a comment - I suspect what you are seeing is a generic jbd2 transaction wraparound problem. There were several rounds at RedHat to fix this, previous round here (already part of rhel6.4 kernel, no idea if they backported to rhel5 too): https://bugzilla.redhat.com/show_bug.cgi?id=735768 and then the latest one seems to be unfolding here: http://lists.debian.org/debian-kernel/2013/03/msg00313.html though I don't know when it will make it to rhel kernels either.
            green Oleg Drokin added a comment -

            Getting the full crashdump from a wedged transaction commit like this should be most useful I suspect.

            green Oleg Drokin added a comment - Getting the full crashdump from a wedged transaction commit like this should be most useful I suspect.

            Hi Oleg,

            That's correct, the system tried to umount once the OSTs went unhealthy, but failed.

            The lustre-log looks pretty slim, hopefully you can find something I missed.

            I will try to get a crashdump. Unfortunately all the ones we've been able to get so far have been incomplete. I'll try to come up with a way to get a full crashdump. Are there any other logs/debug information we can get that won't have too much impact on the running system?

            Thanks,
            Kit

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Oleg, That's correct, the system tried to umount once the OSTs went unhealthy, but failed. The lustre-log looks pretty slim, hopefully you can find something I missed. I will try to get a crashdump. Unfortunately all the ones we've been able to get so far have been incomplete. I'll try to come up with a way to get a full crashdump. Are there any other logs/debug information we can get that won't have too much impact on the running system? Thanks, Kit
            green Oleg Drokin added a comment -

            Hm, I do not see any obvious culprits in the traces.
            A bit strange how jbd thread is the first to complain it's hung and in the middle of transaction commit with no report of any sleep, yet in D state (I wish you had a crashdump so that it could be checked more closely).
            Anyway the first log that should hold most of the data is /tmp/lustre-log.1361661252.1181 I guess? (the biggest one around that timestamp is the one we need) and I guess we need it to at least try to see what was going on at the time.

            Also I see an umount in the logs, but I assume that's after the system was wedged for some time and your scripts just attempt it?

            green Oleg Drokin added a comment - Hm, I do not see any obvious culprits in the traces. A bit strange how jbd thread is the first to complain it's hung and in the middle of transaction commit with no report of any sleep, yet in D state (I wish you had a crashdump so that it could be checked more closely). Anyway the first log that should hold most of the data is /tmp/lustre-log.1361661252.1181 I guess? (the biggest one around that timestamp is the one we need) and I guess we need it to at least try to see what was going on at the time. Also I see an umount in the logs, but I assume that's after the system was wedged for some time and your scripts just attempt it?

            Any update?

            Thanks,
            Kit

            kitwestneat Kit Westneat (Inactive) added a comment - Any update? Thanks, Kit

            Hi we were finally able to get some good debugging data from the system during a failover. Here is the sysrq-t from before it went down. I also managed to save off all the lustre-logs, there are a ton as you can see in the kern.log. Let me know if you want to see them.

            I noticed that there was a slab reclaim going on.. Wasn't there another issue with slab reclaims taking a long time?

            kitwestneat Kit Westneat (Inactive) added a comment - Hi we were finally able to get some good debugging data from the system during a failover. Here is the sysrq-t from before it went down. I also managed to save off all the lustre-logs, there are a ton as you can see in the kern.log. Let me know if you want to see them. I noticed that there was a slab reclaim going on.. Wasn't there another issue with slab reclaims taking a long time?

            kern.log from most recent failover (with sysrq-t)

            kitwestneat Kit Westneat (Inactive) added a comment - kern.log from most recent failover (with sysrq-t)

            People

              green Oleg Drokin
              kitwestneat Kit Westneat (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: