Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
CentOS 7.6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We faced a sudden MDS problem with 2.12.4, a few hours after LU-13442, where users reported this kind of slowness when creating new files:

sh02-ln03:bp86 09:03:28> time touch $SCRATCH/asdf

real 0m26.923s
user 0m0.000s
sys 0m0.003s

Looking at the MDS in question, we could see backtraces like these:

[1803096.937756] LNet: Service thread pid 41461 completed after 237.28s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[1805310.391443] LNet: Service thread pid 20879 was inactive for 200.02s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[1805310.408554] Pid: 20879, comm: mdt03_010 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[1805310.418900] Call Trace:
[1805310.421543]  [<ffffffffc1855fa8>] osp_precreate_reserve+0x2e8/0x800 [osp]
[1805310.428575]  [<ffffffffc184a949>] osp_declare_create+0x199/0x5f0 [osp]
[1805310.435312]  [<ffffffffc179269f>] lod_sub_declare_create+0xdf/0x210 [lod]
[1805310.442330]  [<ffffffffc178a86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[1805310.449586]  [<ffffffffc178d80e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
[1805310.457012]  [<ffffffffc179192d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
[1805310.464007]  [<ffffffffc177296a>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
[1805310.472042]  [<ffffffffc1785725>] lod_declare_layout_change+0xb65/0x10f0 [lod]
[1805310.479468]  [<ffffffffc17f7f82>] mdd_declare_layout_change+0x62/0x120 [mdd]
[1805310.486724]  [<ffffffffc1800ec6>] mdd_layout_change+0xb46/0x16a0 [mdd]
[1805310.493473]  [<ffffffffc166135f>] mdt_layout_change+0x2df/0x480 [mdt]
[1805310.500130]  [<ffffffffc16697d0>] mdt_intent_layout+0x8a0/0xe00 [mdt]
[1805310.506787]  [<ffffffffc1666d35>] mdt_intent_policy+0x435/0xd80 [mdt]
[1805310.513459]  [<ffffffffc0ffbe06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[1805310.520394]  [<ffffffffc10244f6>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[1805310.527688]  [<ffffffffc10acb12>] tgt_enqueue+0x62/0x210 [ptlrpc]
[1805310.534034]  [<ffffffffc10b564a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1805310.541167]  [<ffffffffc105843b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1805310.549063]  [<ffffffffc105bda4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1805310.555596]  [<ffffffff9fac2e81>] kthread+0xd1/0xe0
[1805310.560688]  [<ffffffffa0177c24>] ret_from_fork_nospec_begin+0xe/0x21
[1805310.567340]  [<ffffffffffffffff>] 0xffffffffffffffff
[1805310.572544] LustreError: dumping log to /tmp/lustre-log.1586359839.20879
[1805316.373786] LNet: Service thread pid 20879 completed after 206.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

We took a crash dump available in the Whamcloud FTP as fir-md1-s3_20200408_vmcore

Attaching dmesg as fir-md1-s3_20200408_vmcore-dmesg.txt and foreach bt of the crash dump as fir-md1-s3_20200408_foreach_bt.txt.

A restart of the MDS has fixed the issue for now.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s3_20200408_foreach_bt.txt
856 kB
10/Apr/20 4:57 PM
fir-md1-s3_20200408_vmcore-dmesg.txt
506 kB
10/Apr/20 4:57 PM

Activity

People

Assignee:: Mikhail Pershin

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 10/Apr/20 4:58 PM

Updated:: 16/Jan/22 8:45 AM

Resolved:: 16/Jan/22 8:45 AM