Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1663

MDS threads hang for over 725s, causing fail over

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.7
    • Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443
      Centos 5.5
    • 2
    • 4055

    Description

      At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes.

      We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest.

      Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:
      Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s

      This one makes it look like it might be quota related:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff887f9220>] :lnet:LNetPut+0x730/0x840
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff800649fb>] __down+0xc3/0xd8
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff88a29490>] :lquota:dqacq_handler+0x0/0xc20
      ...

      This one looks a little like LU-1395 or LU-1269:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: Call Trace:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888ceb51>] ldlm_resource_add_lock+0xb1/0x180 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e2a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8009987d>] process_timeout+0x0/0x5
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e4555>] ldlm_completion_ast+0x4d5/0x880 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c9709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c4b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e30bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff88c611a7>] enqueue_ordered_locks+0x387/0x4d0 [mds]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e09a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      ...

      Attachments

        1. 09sep.tar.bz2
          8.40 MB
        2. call_traces
          392 kB
        3. kern.log.2013-02-23.gz
          104 kB
        4. kern.log-20120721
          191 kB
        5. ll-1181-decoded.txt.gz
          0.2 kB
        6. log1.bz2
          438 kB
        7. lustre-log.txt.bz2
          4.55 MB
        8. mds1.log
          8.37 MB
        9. mds2.log
          3.45 MB

        Issue Links

          Activity

            [LU-1663] MDS threads hang for over 725s, causing fail over
            pjones Peter Jones made changes -
            Labels Original: patch New: mn8 patch
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            RH patches ok to use in production

            pjones Peter Jones added a comment - RH patches ok to use in production

            I also just submitted the xattrs patch that was also referenced. We are carrying this patch already at NOAA and it seems to improve stability.

            http://review.whamcloud.com/7788

            kitwestneat Kit Westneat (Inactive) added a comment - I also just submitted the xattrs patch that was also referenced. We are carrying this patch already at NOAA and it seems to improve stability. http://review.whamcloud.com/7788
            pjones Peter Jones added a comment -

            Ihara

            Sorry that my comment was not clear enough. I understand that you wish to have reviews on the patch and I was acknowledging that and then separately adding the reference to the patch. It is required to include such a link in the JIRA ticket to cross-reference between JIRA and gerrit and this step had previously been overlooked

            Peter

            pjones Peter Jones added a comment - Ihara Sorry that my comment was not clear enough. I understand that you wish to have reviews on the patch and I was acknowledging that and then separately adding the reference to the patch. It is required to include such a link in the JIRA ticket to cross-reference between JIRA and gerrit and this step had previously been overlooked Peter

            Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.

            ihara Shuichi Ihara (Inactive) added a comment - Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.
            pjones Peter Jones added a comment - ok Ihara. Patch is at http://review.whamcloud.com/#/c/6147/
            pjones Peter Jones made changes -
            Labels Original: ptr New: patch

            Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.

            ihara Shuichi Ihara (Inactive) added a comment - Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.
            green Oleg Drokin added a comment -

            yes, this patch looks like it would do the right thing.

            green Oleg Drokin added a comment - yes, this patch looks like it would do the right thing.

            People

              green Oleg Drokin
              kitwestneat Kit Westneat (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: