[LU-11418] hung threads on MDT and MDT won't umount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
Affects Version/s: Lustre 2.10.4
Labels:
None
Environment:
x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches ~= 2.10.5 to 2.12

Epic/Theme:
- dne
Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Hi,

unfortunately once again similar/same symptoms as ~~LU-11082~~ and ~~LU-11301~~.

chgrp/chmod sweep across files and directories results in eventual total hang of the filesystem. hung MDT threads. one MDT won't umount. MDS has to be powered off to fix the fs.

processes that are stuck on the client doing the sweep are

root     142716  0.0  0.0 108252   116 pts/1    S    01:33   0:34 xargs -0 -n5 chgrp -h oz044
root     236217  0.0  0.0 108252   116 pts/1    S    01:15   0:25 xargs -0 -n5 chgrp -h oz065
root     385816  0.0  0.0 108252   116 pts/1    S    05:34   0:15 xargs -0 -n5 chgrp -h oz100
root     418923  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz100 oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_ranked.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_full.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/images oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd46
root     418944  0.0  0.0 120512   136 pts/1    S    09:34   0:01 chgrp -h oz044 oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/msdriv.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grexec.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grdos.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/makemake oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys
root     418947  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz065 oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects/functionObjectProperties oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/alpha.water oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/Ur oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/p...

I can't see any rc=-116 in the logs this time.

first hung thread is

Sep 22 09:37:39 warble2 kernel: LNet: Service thread pid 458124 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 22 09:37:39 warble2 kernel: Pid: 458124, comm: mdt01_095 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
Sep 22 09:37:39 warble2 kernel: Call Trace:
Sep 22 09:37:39 warble2 kernel: [<ffffffffc159c047>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc157d91b>] top_trans_stop+0x42b/0x930 [ptlrpc]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc16d65f9>] lod_trans_stop+0x259/0x340 [lod]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc177423a>] mdd_trans_stop+0x2a/0x46 [mdd]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc1769bcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff65f5>] mdt_reint_setattr+0xba5/0x1060 [mdt]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff6b33>] mdt_reint_rec+0x83/0x210 [mdt]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fd836b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fe3f07>] mdt_reint+0x67/0x140 [mdt]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc156a38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc1512e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Sep 22 09:37:39 warble2 kernel: [<ffffffffc1516592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Sep 22 09:37:39 warble2 kernel: [<ffffffffb64bb621>] kthread+0xd1/0xe0
Sep 22 09:37:39 warble2 kernel: [<ffffffffb6b205dd>] ret_from_fork_nospec_begin+0x7/0x21
Sep 22 09:37:39 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Sep 22 09:37:39 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1537573059.458124

there was a subnet manager crash and restart about 15 minutes before the MDS threads hung this time, but I don't think that's related.

first lustre-log for warble2 and syslog for the cluster are attached.

I also did a sryrq 't' and 'w' before resetting warble2, so that may be of help to you.
those start at
Sep 22 16:26:15
in messages.

please let us know if you'd like anything else.
would a kernel crashdump help?
we are getting closer to being able to capture one of these.

cheers,
robin

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

warble2.conman.20181024.log
2.00 MB
24/Oct/18 4:05 AM
messages-warble.20180924.txt.gz
44 kB
24/Sep/18 4:34 AM
messages-warble.20180924.next.txt.gz
287 kB
24/Sep/18 8:25 AM
messages-john99.warble2.Lustre.20181014.txt.gz
10 kB
13/Oct/18 4:39 PM
messages-grep-vslurm.txt.gz
769 kB
22/Sep/18 4:30 PM
lustre-log.1537573059.458124.gz
6.50 MB
22/Sep/18 4:20 PM

Issue Links

is duplicated by

LU-12209 cannot create stripe dir: Stale file handle

Resolved

is related to

LU-15761 cannot finish MDS recovery

Resolved

LU-13070 mdd_orphan_destroy loop caused by compatibility issue on upgrades to 2.11 or later

Resolved

LU-11681 sanity test 65i fails with 'find /mnt/lustre failed'

Resolved

LU-11857 repeated "could not delete orphan [0x200060151:0x38a8:0x0]: rc = -2" messages

Resolved

LU-12747 sanity: test 811 fail with "MDD orphan cleanup thread not quit"

Resolved

is related to

LU-11336 replay-single test 80d hangs on MDT unmount

Open

LU-11516 ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG

Resolved

(1 is related to, 2 is related to )

hung threads on MDT and MDT won't umount

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates