Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11418

hung threads on MDT and MDT won't umount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.10.4
    • None
    • x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches ~= 2.10.5 to 2.12
    • 2
    • 9223372036854775807

    Description

      Hi,

      unfortunately once again similar/same symptoms as LU-11082 and LU-11301.

      chgrp/chmod sweep across files and directories results in eventual total hang of the filesystem. hung MDT threads. one MDT won't umount. MDS has to be powered off to fix the fs.

      processes that are stuck on the client doing the sweep are

      root     142716  0.0  0.0 108252   116 pts/1    S    01:33   0:34 xargs -0 -n5 chgrp -h oz044
      root     236217  0.0  0.0 108252   116 pts/1    S    01:15   0:25 xargs -0 -n5 chgrp -h oz065
      root     385816  0.0  0.0 108252   116 pts/1    S    05:34   0:15 xargs -0 -n5 chgrp -h oz100
      root     418923  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz100 oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_ranked.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_full.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/images oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd46
      root     418944  0.0  0.0 120512   136 pts/1    S    09:34   0:01 chgrp -h oz044 oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/msdriv.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grexec.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grdos.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/makemake oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys
      root     418947  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz065 oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects/functionObjectProperties oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/alpha.water oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/Ur oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/p...
      

      I can't see any rc=-116 in the logs this time.

      first hung thread is

      Sep 22 09:37:39 warble2 kernel: LNet: Service thread pid 458124 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 22 09:37:39 warble2 kernel: Pid: 458124, comm: mdt01_095 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      Sep 22 09:37:39 warble2 kernel: Call Trace:
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc159c047>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc157d91b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc16d65f9>] lod_trans_stop+0x259/0x340 [lod]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc177423a>] mdd_trans_stop+0x2a/0x46 [mdd]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1769bcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff65f5>] mdt_reint_setattr+0xba5/0x1060 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff6b33>] mdt_reint_rec+0x83/0x210 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fd836b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fe3f07>] mdt_reint+0x67/0x140 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc156a38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1512e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1516592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffb64bb621>] kthread+0xd1/0xe0
      Sep 22 09:37:39 warble2 kernel: [<ffffffffb6b205dd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep 22 09:37:39 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Sep 22 09:37:39 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1537573059.458124
      

      there was a subnet manager crash and restart about 15 minutes before the MDS threads hung this time, but I don't think that's related.

      first lustre-log for warble2 and syslog for the cluster are attached.

      I also did a sryrq 't' and 'w' before resetting warble2, so that may be of help to you.
      those start at
      Sep 22 16:26:15
      in messages.

      please let us know if you'd like anything else.
      would a kernel crashdump help?
      we are getting closer to being able to capture one of these.

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: