Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1663

MDS threads hang for over 725s, causing fail over

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.7
    • Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443
      Centos 5.5
    • 2
    • 4055

    Description

      At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes.

      We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest.

      Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:
      Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s

      This one makes it look like it might be quota related:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff887f9220>] :lnet:LNetPut+0x730/0x840
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff800649fb>] __down+0xc3/0xd8
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff88a29490>] :lquota:dqacq_handler+0x0/0xc20
      ...

      This one looks a little like LU-1395 or LU-1269:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: Call Trace:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888ceb51>] ldlm_resource_add_lock+0xb1/0x180 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e2a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8009987d>] process_timeout+0x0/0x5
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e4555>] ldlm_completion_ast+0x4d5/0x880 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c9709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c4b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e30bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff88c611a7>] enqueue_ordered_locks+0x387/0x4d0 [mds]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e09a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      ...

      Attachments

        1. 09sep.tar.bz2
          8.40 MB
        2. call_traces
          392 kB
        3. kern.log.2013-02-23.gz
          104 kB
        4. kern.log-20120721
          191 kB
        5. ll-1181-decoded.txt.gz
          0.2 kB
        6. log1.bz2
          438 kB
        7. lustre-log.txt.bz2
          4.55 MB
        8. mds1.log
          8.37 MB
        9. mds2.log
          3.45 MB

        Issue Links

          Activity

            [LU-1663] MDS threads hang for over 725s, causing fail over
            pjones Peter Jones added a comment -

            RH patches ok to use in production

            pjones Peter Jones added a comment - RH patches ok to use in production

            I also just submitted the xattrs patch that was also referenced. We are carrying this patch already at NOAA and it seems to improve stability.

            http://review.whamcloud.com/7788

            kitwestneat Kit Westneat (Inactive) added a comment - I also just submitted the xattrs patch that was also referenced. We are carrying this patch already at NOAA and it seems to improve stability. http://review.whamcloud.com/7788
            pjones Peter Jones added a comment -

            Ihara

            Sorry that my comment was not clear enough. I understand that you wish to have reviews on the patch and I was acknowledging that and then separately adding the reference to the patch. It is required to include such a link in the JIRA ticket to cross-reference between JIRA and gerrit and this step had previously been overlooked

            Peter

            pjones Peter Jones added a comment - Ihara Sorry that my comment was not clear enough. I understand that you wish to have reviews on the patch and I was acknowledging that and then separately adding the reference to the patch. It is required to include such a link in the JIRA ticket to cross-reference between JIRA and gerrit and this step had previously been overlooked Peter

            Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.

            ihara Shuichi Ihara (Inactive) added a comment - Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.
            pjones Peter Jones added a comment - ok Ihara. Patch is at http://review.whamcloud.com/#/c/6147/

            Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.

            ihara Shuichi Ihara (Inactive) added a comment - Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.
            green Oleg Drokin added a comment -

            yes, this patch looks like it would do the right thing.

            green Oleg Drokin added a comment - yes, this patch looks like it would do the right thing.

            The openviz link was just supposed to be a link to the patch, the kernel we are running is a stock Centos kernel. The backtrace and the core come from the same crash. There is another core, but it was placed on the wrong disk and we need to wait for a downtime to get it (next Tuesday). If it is a different backtrace, I'll update the ticket then.

            As far as the fix for the GFP_NOFS bug, does that patch work? I found the patch on git.kernel.org as well:
            http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4

            Should I backport that to b1_8/RHEL5?

            Thanks.

            kitwestneat Kit Westneat (Inactive) added a comment - The openviz link was just supposed to be a link to the patch, the kernel we are running is a stock Centos kernel. The backtrace and the core come from the same crash. There is another core, but it was placed on the wrong disk and we need to wait for a downtime to get it (next Tuesday). If it is a different backtrace, I'll update the ticket then. As far as the fix for the GFP_NOFS bug, does that patch work? I found the patch on git.kernel.org as well: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4 Should I backport that to b1_8/RHEL5? Thanks.
            green Oleg Drokin added a comment -

            Ok, I got vmcore, but the vmlinux from openvz is not it.

            In any case I looked inside of the core and it looks like it's for this latest crash that we already know what happened that is a different bug altogether.

            What I am looking for is vmcore from a situation where jbd2 commit threads locks up.

            green Oleg Drokin added a comment - Ok, I got vmcore, but the vmlinux from openvz is not it. In any case I looked inside of the core and it looks like it's for this latest crash that we already know what happened that is a different bug altogether. What I am looking for is vmcore from a situation where jbd2 commit threads locks up.

            I pushed the patch we are using for the transaction wrap around in case it would be useful:
            http://review.whamcloud.com/6147

            kitwestneat Kit Westneat (Inactive) added a comment - I pushed the patch we are using for the transaction wrap around in case it would be useful: http://review.whamcloud.com/6147

            People

              green Oleg Drokin
              kitwestneat Kit Westneat (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: