Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.12.4
-
None
-
CentOS 7.6
-
3
-
9223372036854775807
Description
We faced a sudden MDS problem with 2.12.4, a few hours after LU-13442, where users reported this kind of slowness when creating new files:
sh02-ln03:bp86 09:03:28> time touch $SCRATCH/asdf real 0m26.923s user 0m0.000s sys 0m0.003s
Looking at the MDS in question, we could see backtraces like these:
[1803096.937756] LNet: Service thread pid 41461 completed after 237.28s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [1805310.391443] LNet: Service thread pid 20879 was inactive for 200.02s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [1805310.408554] Pid: 20879, comm: mdt03_010 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 [1805310.418900] Call Trace: [1805310.421543] [<ffffffffc1855fa8>] osp_precreate_reserve+0x2e8/0x800 [osp] [1805310.428575] [<ffffffffc184a949>] osp_declare_create+0x199/0x5f0 [osp] [1805310.435312] [<ffffffffc179269f>] lod_sub_declare_create+0xdf/0x210 [lod] [1805310.442330] [<ffffffffc178a86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] [1805310.449586] [<ffffffffc178d80e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod] [1805310.457012] [<ffffffffc179192d>] lod_qos_prep_create+0x12fd/0x1890 [lod] [1805310.464007] [<ffffffffc177296a>] lod_declare_instantiate_components+0x9a/0x1d0 [lod] [1805310.472042] [<ffffffffc1785725>] lod_declare_layout_change+0xb65/0x10f0 [lod] [1805310.479468] [<ffffffffc17f7f82>] mdd_declare_layout_change+0x62/0x120 [mdd] [1805310.486724] [<ffffffffc1800ec6>] mdd_layout_change+0xb46/0x16a0 [mdd] [1805310.493473] [<ffffffffc166135f>] mdt_layout_change+0x2df/0x480 [mdt] [1805310.500130] [<ffffffffc16697d0>] mdt_intent_layout+0x8a0/0xe00 [mdt] [1805310.506787] [<ffffffffc1666d35>] mdt_intent_policy+0x435/0xd80 [mdt] [1805310.513459] [<ffffffffc0ffbe06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [1805310.520394] [<ffffffffc10244f6>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [1805310.527688] [<ffffffffc10acb12>] tgt_enqueue+0x62/0x210 [ptlrpc] [1805310.534034] [<ffffffffc10b564a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1805310.541167] [<ffffffffc105843b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1805310.549063] [<ffffffffc105bda4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1805310.555596] [<ffffffff9fac2e81>] kthread+0xd1/0xe0 [1805310.560688] [<ffffffffa0177c24>] ret_from_fork_nospec_begin+0xe/0x21 [1805310.567340] [<ffffffffffffffff>] 0xffffffffffffffff [1805310.572544] LustreError: dumping log to /tmp/lustre-log.1586359839.20879 [1805316.373786] LNet: Service thread pid 20879 completed after 206.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
We took a crash dump available in the Whamcloud FTP as fir-md1-s3_20200408_vmcore
Attaching dmesg as fir-md1-s3_20200408_vmcore-dmesg.txt
and foreach bt of the crash dump as fir-md1-s3_20200408_foreach_bt.txt
.
A restart of the MDS has fixed the issue for now.