Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.10.3, Lustre 2.10.4
    • None
    • x86_64 centos7.4 OPA ZFS 0.7.9 lustre 2.10.4
    • 3
    • 9223372036854775807

    Description

      Hi,

      our main filesystem has hung approximately 10 times since April.

      the symptoms are stuck thread messages on the MDS (it has happened on both our MDS's), and then soon afterwards clients can't stat any files in the filesystem.

      we have 4 filesystems that share the same MDS hardware and only the large DNE filesystem with 3 MDTs has hung, so we think this is probably a DNE issue.

      there are no logs that indicate this is a MDS hardware problem. despite that we've swapped out a lot of hardware and tested the MDS components as well as we can and it hasn't helped at all. 'zpool iostat 2' shows some tiny amounts of i/o are still going to the MDTs when the filesystems are hung, so eg. hardware raid isn't just silently stopping with no logs.

      MDT's are zmirror on top of hardware raid1 (so 4-way replicated).

      to un-hang the filesystem, we failover or reboot the MDS. every time we can't umount one of the 3 MDTs in the /dagg(fred) filesystem, and we have to forcibly power off the MDS (which led to LU10877 etc.). recovery these days goes ok, and then the filesystem works again. it seems random which of the three MDTs hangs on umount. we've lfsck'd the namespace about 5 times.

      here are first stack traces from each of the 2 separate hangs today, which are pretty typical. usually it's a mdt_rdpg* thread that hangs and those stack traces all look very similar, but occasionally it's a mdt* thread and there's some variations on a theme in those.

      /var/log/messages:Jun 11 11:06:59 warble2 kernel: LNet: Service thread pid 116276 was inactive for 200.63s. The thread might be hung, or it might only be slow and will res
      ume later. Dumping the stack trace for debugging purposes:
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: Pid: 116276, comm: mdt_rdpg00_024
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: #012Call Trace:
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff8109eb8b>] ? recalc_sigpending+0x1b/0x50
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff816b40e9>] schedule+0x29/0x70
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0e16f10>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0df882b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc164f5e9>] lod_trans_stop+0x259/0x340 [lod]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0df8fae>] ? top_trans_start+0x27e/0x940 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc16ed1ea>] mdd_trans_stop+0x2a/0x46 [mdd]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc16e2bcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d81d47>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc15ce186>] mdt_mfd_close+0x1a6/0x610 [mdt]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc15d3901>] mdt_close_internal+0x121/0x220 [mdt]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc15d3c20>] mdt_close+0x220/0x780 [mdt]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0de52ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d8de2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d8a458>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d91572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d90ae0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: 
      /var/log/messages:Jun 11 11:06:59 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1528679219.116276
      
      Jun 11 14:17:28 warble2 kernel: LNet: Service thread pid 22072 was inactive for 200.35s. The thread might be hung, or it might only be slow and will resume later. Dumping 
      the stack trace for debugging purposes:
      Jun 11 14:17:28 warble2 kernel: Pid: 22072, comm: mdt00_001
      Jun 11 14:17:28 warble2 kernel: #012Call Trace:
      Jun 11 14:17:28 warble2 kernel: [<ffffffff8109eb8b>] ? recalc_sigpending+0x1b/0x50
      Jun 11 14:17:28 warble2 kernel: [<ffffffff816b40e9>] schedule+0x29/0x70
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0dd3f10>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0db582b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc10f35e9>] lod_trans_stop+0x259/0x340 [lod]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0db5fae>] ? top_trans_start+0x27e/0x940 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc11911ea>] mdd_trans_stop+0x2a/0x46 [mdd]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc117eb31>] mdd_rename+0x4d1/0x14a0 [mdd]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc103b07a>] mdt_reint_rename_internal.isra.36+0x166a/0x20c0 [mdt]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc103d3fb>] mdt_reint_rename_or_migrate.isra.39+0x19b/0x860 [mdt]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d0eff0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d09440>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc103daf3>] mdt_reint_rename+0x13/0x20 [mdt]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc1041b33>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc102337b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc102ef07>] mdt_reint+0x67/0x140 [mdt]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0da22ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d4ae2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d47458>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
      Jun 11 14:17:28 warble2 kernel: [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d4e572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d4dae0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Jun 11 14:17:28 warble2 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0
      Jun 11 14:17:28 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      Jun 11 14:17:28 warble2 kernel: [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
      Jun 11 14:17:28 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      Jun 11 14:17:28 warble2 kernel: 
      Jun 11 14:17:28 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1528690648.22072
      

      I've attach logs of all the stack traces

      zgrep warble /var/log/messages-20180[3456]* /var/log/messages | egrep -v 'crmd|pengine' | grep -i 'stack trace' -B20 -A40 > ~/lustre-bugs/warble-stack-traces.txt
      

      clients at the moment are a mix of centos7.4 and 2.10.3/2.10.4, and centos7.5 2.10.4.
      servers are centos7.4 and 2.10.4 with zfs 0.7.9 and boot with nopti.

      we have inherited directory striping on the main filesystem, so each new dir is on a random one of the three MDTs.

      we have no idea what i/o pattern triggers the lockups, sorry.

      cheers,
      robin

      Attachments

        1. lustre-log.1529816295.52919
          49.77 MB
        2. lustre-log.1533268164.11346.gz
          6.07 MB
        3. messages.warble
          83 kB
        4. warble-stack-traces.txt
          958 kB

        Issue Links

          Activity

            [LU-11082] stuck threads on MDS
            pjones Peter Jones added a comment -

            Good news. So let's close out this ticket then

            pjones Peter Jones added a comment - Good news. So let's close out this ticket then
            scadmin SC Admin added a comment -

            the chgrp/chmod script finished successfully. fixed up about 10M files and a lot of dirs over 2.5 days. no crash on the MDS.

            as the hung threads in the previous crash looked different to those at the start of this ticket, and this most recent sweep finished without issue, I would err on the side of saying this issue has been fixed.

            thanks!

            cheers,
            robin

            scadmin SC Admin added a comment - the chgrp/chmod script finished successfully. fixed up about 10M files and a lot of dirs over 2.5 days. no crash on the MDS. as the hung threads in the previous crash looked different to those at the start of this ticket, and this most recent sweep finished without issue, I would err on the side of saying this issue has been fixed. thanks! cheers, robin
            scadmin SC Admin added a comment -

            Hi Peter,

            yes, that looks like it. thanks.
            I've included the b2_10 version of the stacktrace patch into our MDS builds, and kicked off a re-run of the chmod/chgrp script.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Peter, yes, that looks like it. thanks. I've included the b2_10 version of the stacktrace patch into our MDS builds, and kicked off a re-run of the chmod/chgrp script. cheers, robin
            pjones Peter Jones added a comment -

            Is the stack trace issue LU-11062 perhaps?

            pjones Peter Jones added a comment - Is the stack trace issue LU-11062 perhaps?
            scadmin SC Admin added a comment -

            Hi,

            after 2 days of running, the find/chgrp/chmod sweep through all files almost completed, but not quite.

            once again the MDTs hung, but it looked different to the previous hangs. this time all 3 MDTs wouldn't umount, and previously it's just been (a random) one of them. resolution was the same though - had to power cycle the MDS.

            I'll attach all warble related syslog (warble1 is where all 3 MDTs are mounted at the moment).
            I'll attach the first lustre-log.1533268164.11346.gz.

            stack traces were not printed to anywhere because

            Aug  3 13:49:24 warble1 kernel: LNet: 17382:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel doesn't export show_task
            

            is this a new change?

            we're running 3.10.0-862.9.1.el7.x86_64 centos7.5 everywhere now. server lustre is 2.10.4 plus ->

              lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
              lu10988-lfsck2-gerrit32522-21d33c11.patch
              lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
              lu11107-xattr-gerrit32753-c96a8f08.patch
              lu11111-lfsck-gerrit32796-693fe452.patch
              lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
            

            because the symptoms are different, if I had to guess I'd say that the patch here has helped a lot, and perhaps fixed the issue, and maybe we're running into something new now, but without stacktraces I really have no idea.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, after 2 days of running, the find/chgrp/chmod sweep through all files almost completed, but not quite. once again the MDTs hung, but it looked different to the previous hangs. this time all 3 MDTs wouldn't umount, and previously it's just been (a random) one of them. resolution was the same though - had to power cycle the MDS. I'll attach all warble related syslog (warble1 is where all 3 MDTs are mounted at the moment). I'll attach the first lustre-log.1533268164.11346.gz. stack traces were not printed to anywhere because Aug 3 13:49:24 warble1 kernel: LNet: 17382:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel doesn't export show_task is this a new change? we're running 3.10.0-862.9.1.el7.x86_64 centos7.5 everywhere now. server lustre is 2.10.4 plus -> lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch lu10988-lfsck2-gerrit32522-21d33c11.patch lu11074-mdc-xattr-gerrit32739-dea1cde9.patch lu11107-xattr-gerrit32753-c96a8f08.patch lu11111-lfsck-gerrit32796-693fe452.patch lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff because the symptoms are different, if I had to guess I'd say that the patch here has helped a lot, and perhaps fixed the issue, and maybe we're running into something new now, but without stacktraces I really have no idea. cheers, robin
            scadmin SC Admin added a comment -

            Hi Peter,

            no issues so far.

            however I haven't yet run the chmod/chgrp sweep script which was maybe a trigger for this issue.
            partly because (TBH) I've been enjoying the stability, but also because I haven't had any free time.
            I'll run it this week and report back.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Peter, no issues so far. however I haven't yet run the chmod/chgrp sweep script which was maybe a trigger for this issue. partly because (TBH) I've been enjoying the stability, but also because I haven't had any free time. I'll run it this week and report back. cheers, robin
            pjones Peter Jones added a comment -

            Robin

            How long will this patch need to be running before you would conclude that it has improved the situation?

            Peter

            pjones Peter Jones added a comment - Robin How long will this patch need to be running before you would conclude that it has improved the situation? Peter
            scadmin SC Admin added a comment -

            thanks. we have that on out MDS's now.

            cheers,
            robin

            scadmin SC Admin added a comment - thanks. we have that on out MDS's now. cheers, robin
            laisiyao Lai Siyao added a comment - The backport for 2.10 of https://review.whamcloud.com/#/c/32589/ is on https://review.whamcloud.com/32853 .
            scadmin SC Admin added a comment -

            Hi,

            how's this coming along?
            there are 1000's of lines of code different between b2_10 and master across these 4 files, so I'm not confident about backporting this myself to try it out.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, how's this coming along? there are 1000's of lines of code different between b2_10 and master across these 4 files, so I'm not confident about backporting this myself to try it out. cheers, robin

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: