Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.0
-
None
-
CentOS 7.6, Linux Kernel 3.10.0-957.1.3.el7_lustre.x86_64, Lustre 2.12 clients (Sherlock cluster)
-
2
-
9223372036854775807
Description
On Fir, our new scratch for Sherlock running 2.12, the filesystem hung today and then operations resumed. Fir is using DNE, PFL and DOM. The MDS hosting mdt[0,2] seems to have been under a heavy load. For example:
[481509.742936] Call Trace: [481509.745520] [<ffffffffc0f73303>] ldlm_process_inodebits_lock+0x93/0x3d0 [ptlrpc] [481509.753123] [<ffffffffc0f61fc0>] ? ldlm_export_lock_object+0x10/0x10 [ptlrpc] [481509.760464] [<ffffffffc0f73270>] ? ldlm_inodebits_compat_queue+0x440/0x440 [ptlrpc] [481509.768321] [<ffffffffc0f4160e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc] [481509.775403] [<ffffffffc0f426cb>] __ldlm_reprocess_all+0x10b/0x380 [ptlrpc] [481509.782482] [<ffffffffc0f434a2>] ldlm_cancel_lock_for_export.isra.26+0x1c2/0x390 [ptlrpc] [481509.790863] [<ffffffffc0f43851>] ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc] [481509.798987] [<ffffffffc0f6c6e2>] ldlm_bl_thread_main+0x112/0x700 [ptlrpc] [481509.805950] [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20 [481509.811650] [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc] [481509.819130] [<ffffffff9d6c1c31>] kthread+0xd1/0xe0 [481509.824095] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481509.830277] [<ffffffff9dd74c24>] ret_from_fork_nospec_begin+0xe/0x21 [481509.836799] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481509.842979] Code: 74 0e 4c 89 f2 48 89 de 4c 89 ff e8 d3 a2 fc ff 49 8b 87 10 02 00 00 4c 8d a8 f0 fd ff ff 4d 39 fd 74 2b 49 83 bd a8 00 00 00 00 <74> 0e 4c 89 f2 48 89 de 4c 89 ef e8 a8 a2 fc ff 4d 8b ad 10 02 [481516.050304] INFO: rcu_sched self-detected stall on CPU [481516.051302] INFO: rcu_sched self-detected stall on CPU [481516.051303] { [481516.051303] 15 [481516.051304] } [481516.051314] (t=60001 jiffies g=115591743 c=115591742 q=6728165) [481516.051315] Task dump for CPU 14: [481516.051316] ldlm_bl_01 R [481516.051316] running task [481516.051318] 0 54053 2 0x00000088 [481516.051318] Call Trace: [481516.051362] [<ffffffffc0f43851>] ? ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc] [481516.051394] [<ffffffffc0f6c6e2>] ? ldlm_bl_thread_main+0x112/0x700 [ptlrpc] [481516.051397] [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20 [481516.051426] [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc] [481516.051427] [<ffffffff9d6c1c31>] ? kthread+0xd1/0xe0 [481516.051429] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481516.051431] [<ffffffff9dd74c24>] ? ret_from_fork_nospec_begin+0xe/0x21 [481516.051432] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481516.051432] Task dump for CPU 15:
filesystem is still running but some threads are at 100% on the MDS
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 54053 root 20 0 0 0 0 R 100.0 0.0 2:02.22 ldlm_bl_01 55962 root 20 0 0 0 0 R 100.0 0.0 2:01.79 ldlm_bl_03 56043 root 20 0 0 0 0 R 100.0 0.0 3:20.41 ldlm_bl_04 63820 root 20 0 0 0 0 R 100.0 0.0 10:13.65 mdt03_033
[root@fir-md1-s1 ~]# lctl get_param ldlm.*.*mdt*.* ldlm.namespaces.mdt-fir-MDT0000_UUID.contended_locks=32 ldlm.namespaces.mdt-fir-MDT0000_UUID.contention_seconds=2 ldlm.namespaces.mdt-fir-MDT0000_UUID.ctime_age_limit=10 ldlm.namespaces.mdt-fir-MDT0000_UUID.dirty_age_limit=10 ldlm.namespaces.mdt-fir-MDT0000_UUID.early_lock_cancel=0 ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=146645 ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_timeouts=8 ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_unused_count=0 ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_max_age=3900000 ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_size=4800 ldlm.namespaces.mdt-fir-MDT0000_UUID.max_nolock_bytes=0 ldlm.namespaces.mdt-fir-MDT0000_UUID.max_parallel_ast=1024 ldlm.namespaces.mdt-fir-MDT0000_UUID.resource_count=62361 ldlm.namespaces.mdt-fir-MDT0002_UUID.contended_locks=32 ldlm.namespaces.mdt-fir-MDT0002_UUID.contention_seconds=2 ldlm.namespaces.mdt-fir-MDT0002_UUID.ctime_age_limit=10 ldlm.namespaces.mdt-fir-MDT0002_UUID.dirty_age_limit=10 ldlm.namespaces.mdt-fir-MDT0002_UUID.early_lock_cancel=0 ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=6472231 ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_timeouts=166760 ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_unused_count=0 ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_max_age=3900000 ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_size=4800 ldlm.namespaces.mdt-fir-MDT0002_UUID.max_nolock_bytes=0 ldlm.namespaces.mdt-fir-MDT0002_UUID.max_parallel_ast=1024 ldlm.namespaces.mdt-fir-MDT0002_UUID.resource_count=359599
MDS resumed operations by itself, the last logs were:
[482208.719700] LustreError: 63820:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549999197, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8995f523f740/0x1a4b7ac741771fee lrc: 3/0,1 mode: --/PW res: [0x2c00033c7:0x2b6e:0x0].0x0 bits 0x40/0x0 rrc: 39993 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 63820 timeout: 0 lvb_type: 0 [482361.725697] LNet: Service thread pid 63820 completed after 1225.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [482666.099950] Lustre: fir-MDT0002: Client 0ccdc4e2-9749-c9a5-afb4-85874ce74d6c (at 10.0.10.3@o2ib7) reconnecting
I'm attaching the full syslog output.
Now everything looks good even though this MDS is still quite loaded in terms of CPU.
Thanks,
Stephane
sthiell that's what the LTS12 label means