Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11418

hung threads on MDT and MDT won't umount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.10.4
    • None
    • x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches ~= 2.10.5 to 2.12
    • 2
    • 9223372036854775807

    Description

      Hi,

      unfortunately once again similar/same symptoms as LU-11082 and LU-11301.

      chgrp/chmod sweep across files and directories results in eventual total hang of the filesystem. hung MDT threads. one MDT won't umount. MDS has to be powered off to fix the fs.

      processes that are stuck on the client doing the sweep are

      root     142716  0.0  0.0 108252   116 pts/1    S    01:33   0:34 xargs -0 -n5 chgrp -h oz044
      root     236217  0.0  0.0 108252   116 pts/1    S    01:15   0:25 xargs -0 -n5 chgrp -h oz065
      root     385816  0.0  0.0 108252   116 pts/1    S    05:34   0:15 xargs -0 -n5 chgrp -h oz100
      root     418923  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz100 oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_ranked.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_full.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/images oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd46
      root     418944  0.0  0.0 120512   136 pts/1    S    09:34   0:01 chgrp -h oz044 oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/msdriv.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grexec.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grdos.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/makemake oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys
      root     418947  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz065 oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects/functionObjectProperties oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/alpha.water oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/Ur oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/p...
      

      I can't see any rc=-116 in the logs this time.

      first hung thread is

      Sep 22 09:37:39 warble2 kernel: LNet: Service thread pid 458124 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 22 09:37:39 warble2 kernel: Pid: 458124, comm: mdt01_095 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      Sep 22 09:37:39 warble2 kernel: Call Trace:
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc159c047>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc157d91b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc16d65f9>] lod_trans_stop+0x259/0x340 [lod]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc177423a>] mdd_trans_stop+0x2a/0x46 [mdd]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1769bcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff65f5>] mdt_reint_setattr+0xba5/0x1060 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff6b33>] mdt_reint_rec+0x83/0x210 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fd836b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fe3f07>] mdt_reint+0x67/0x140 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc156a38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1512e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1516592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffb64bb621>] kthread+0xd1/0xe0
      Sep 22 09:37:39 warble2 kernel: [<ffffffffb6b205dd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep 22 09:37:39 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Sep 22 09:37:39 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1537573059.458124
      

      there was a subnet manager crash and restart about 15 minutes before the MDS threads hung this time, but I don't think that's related.

      first lustre-log for warble2 and syslog for the cluster are attached.

      I also did a sryrq 't' and 'w' before resetting warble2, so that may be of help to you.
      those start at
      Sep 22 16:26:15
      in messages.

      please let us know if you'd like anything else.
      would a kernel crashdump help?
      we are getting closer to being able to capture one of these.

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11418] hung threads on MDT and MDT won't umount

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34326/
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 6a412a8671d3d76b5da55c08ada011e7aeea1e8c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34326/ Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 6a412a8671d3d76b5da55c08ada011e7aeea1e8c

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34327
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 694a92ec774d5bd958a61f457fc64380feb95db2

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34327 Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 694a92ec774d5bd958a61f457fc64380feb95db2

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34326
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: ec0944fc22a351b44332984050606f0efb1d3b63

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34326 Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ec0944fc22a351b44332984050606f0efb1d3b63
            laisiyao Lai Siyao added a comment -

            Peter, it's tracked under LU-11681, when it passed reviews, I'll backport them to 2.10.

            laisiyao Lai Siyao added a comment - Peter, it's tracked under LU-11681 , when it passed reviews, I'll backport them to 2.10.
            pjones Peter Jones added a comment -

            The existing patch has landed for 2.13 and could now potentially be included in 2.10.x or 2.12.x maintenance releases. Lai, if you're still working on a further patch, what ticket is it being tracked under?

            pjones Peter Jones added a comment - The existing patch has landed for 2.13 and could now potentially be included in 2.10.x or 2.12.x maintenance releases. Lai, if you're still working on a further patch, what ticket is it being tracked under?

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33661/
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fffef5c29e3bdf0f96168abc3d0488bad06f33bb

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33661/ Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: master Current Patch Set: Commit: fffef5c29e3bdf0f96168abc3d0488bad06f33bb

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: