Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.4
-
None
-
x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches ~= 2.10.5 to 2.12
-
2
-
9223372036854775807
Description
Hi,
unfortunately once again similar/same symptoms as LU-11082 and LU-11301.
chgrp/chmod sweep across files and directories results in eventual total hang of the filesystem. hung MDT threads. one MDT won't umount. MDS has to be powered off to fix the fs.
processes that are stuck on the client doing the sweep are
root 142716 0.0 0.0 108252 116 pts/1 S 01:33 0:34 xargs -0 -n5 chgrp -h oz044 root 236217 0.0 0.0 108252 116 pts/1 S 01:15 0:25 xargs -0 -n5 chgrp -h oz065 root 385816 0.0 0.0 108252 116 pts/1 S 05:34 0:15 xargs -0 -n5 chgrp -h oz100 root 418923 0.0 0.0 120512 136 pts/1 S 09:34 0:00 chgrp -h oz100 oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_ranked.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_full.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/images oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd46 root 418944 0.0 0.0 120512 136 pts/1 S 09:34 0:01 chgrp -h oz044 oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/msdriv.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grexec.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grdos.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/makemake oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys root 418947 0.0 0.0 120512 136 pts/1 S 09:34 0:00 chgrp -h oz065 oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects/functionObjectProperties oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/alpha.water oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/Ur oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/p...
I can't see any rc=-116 in the logs this time.
first hung thread is
Sep 22 09:37:39 warble2 kernel: LNet: Service thread pid 458124 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Sep 22 09:37:39 warble2 kernel: Pid: 458124, comm: mdt01_095 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 Sep 22 09:37:39 warble2 kernel: Call Trace: Sep 22 09:37:39 warble2 kernel: [<ffffffffc159c047>] top_trans_wait_result+0xa6/0x155 [ptlrpc] Sep 22 09:37:39 warble2 kernel: [<ffffffffc157d91b>] top_trans_stop+0x42b/0x930 [ptlrpc] Sep 22 09:37:39 warble2 kernel: [<ffffffffc16d65f9>] lod_trans_stop+0x259/0x340 [lod] Sep 22 09:37:39 warble2 kernel: [<ffffffffc177423a>] mdd_trans_stop+0x2a/0x46 [mdd] Sep 22 09:37:39 warble2 kernel: [<ffffffffc1769bcb>] mdd_attr_set+0x5eb/0xce0 [mdd] Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff65f5>] mdt_reint_setattr+0xba5/0x1060 [mdt] Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff6b33>] mdt_reint_rec+0x83/0x210 [mdt] Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fd836b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fe3f07>] mdt_reint+0x67/0x140 [mdt] Sep 22 09:37:39 warble2 kernel: [<ffffffffc156a38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Sep 22 09:37:39 warble2 kernel: [<ffffffffc1512e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Sep 22 09:37:39 warble2 kernel: [<ffffffffc1516592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Sep 22 09:37:39 warble2 kernel: [<ffffffffb64bb621>] kthread+0xd1/0xe0 Sep 22 09:37:39 warble2 kernel: [<ffffffffb6b205dd>] ret_from_fork_nospec_begin+0x7/0x21 Sep 22 09:37:39 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Sep 22 09:37:39 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1537573059.458124
there was a subnet manager crash and restart about 15 minutes before the MDS threads hung this time, but I don't think that's related.
first lustre-log for warble2 and syslog for the cluster are attached.
I also did a sryrq 't' and 'w' before resetting warble2, so that may be of help to you.
those start at
Sep 22 16:26:15
in messages.
please let us know if you'd like anything else.
would a kernel crashdump help?
we are getting closer to being able to capture one of these.
cheers,
robin
Attachments
Issue Links
- is duplicated by
-
LU-12209 cannot create stripe dir: Stale file handle
- Resolved
- is related to
-
LU-15761 cannot finish MDS recovery
- Resolved
-
LU-13070 mdd_orphan_destroy loop caused by compatibility issue on upgrades to 2.11 or later
- Resolved
-
LU-11681 sanity test 65i fails with 'find /mnt/lustre failed'
- Resolved
-
LU-11857 repeated "could not delete orphan [0x200060151:0x38a8:0x0]: rc = -2" messages
- Resolved
-
LU-12747 sanity: test 811 fail with "MDD orphan cleanup thread not quit"
- Resolved
- is related to
-
LU-11336 replay-single test 80d hangs on MDT unmount
- Open
-
LU-11516 ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG
- Resolved