Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.0
-
None
-
CentOS 7.6, Linux Kernel 3.10.0-957.1.3.el7_lustre.x86_64, Lustre 2.12 clients (Sherlock cluster)
-
2
-
9223372036854775807
Description
On Fir, our new scratch for Sherlock running 2.12, the filesystem hung today and then operations resumed. Fir is using DNE, PFL and DOM. The MDS hosting mdt[0,2] seems to have been under a heavy load. For example:
[481509.742936] Call Trace: [481509.745520] [<ffffffffc0f73303>] ldlm_process_inodebits_lock+0x93/0x3d0 [ptlrpc] [481509.753123] [<ffffffffc0f61fc0>] ? ldlm_export_lock_object+0x10/0x10 [ptlrpc] [481509.760464] [<ffffffffc0f73270>] ? ldlm_inodebits_compat_queue+0x440/0x440 [ptlrpc] [481509.768321] [<ffffffffc0f4160e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc] [481509.775403] [<ffffffffc0f426cb>] __ldlm_reprocess_all+0x10b/0x380 [ptlrpc] [481509.782482] [<ffffffffc0f434a2>] ldlm_cancel_lock_for_export.isra.26+0x1c2/0x390 [ptlrpc] [481509.790863] [<ffffffffc0f43851>] ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc] [481509.798987] [<ffffffffc0f6c6e2>] ldlm_bl_thread_main+0x112/0x700 [ptlrpc] [481509.805950] [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20 [481509.811650] [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc] [481509.819130] [<ffffffff9d6c1c31>] kthread+0xd1/0xe0 [481509.824095] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481509.830277] [<ffffffff9dd74c24>] ret_from_fork_nospec_begin+0xe/0x21 [481509.836799] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481509.842979] Code: 74 0e 4c 89 f2 48 89 de 4c 89 ff e8 d3 a2 fc ff 49 8b 87 10 02 00 00 4c 8d a8 f0 fd ff ff 4d 39 fd 74 2b 49 83 bd a8 00 00 00 00 <74> 0e 4c 89 f2 48 89 de 4c 89 ef e8 a8 a2 fc ff 4d 8b ad 10 02 [481516.050304] INFO: rcu_sched self-detected stall on CPU [481516.051302] INFO: rcu_sched self-detected stall on CPU [481516.051303] { [481516.051303] 15 [481516.051304] } [481516.051314] (t=60001 jiffies g=115591743 c=115591742 q=6728165) [481516.051315] Task dump for CPU 14: [481516.051316] ldlm_bl_01 R [481516.051316] running task [481516.051318] 0 54053 2 0x00000088 [481516.051318] Call Trace: [481516.051362] [<ffffffffc0f43851>] ? ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc] [481516.051394] [<ffffffffc0f6c6e2>] ? ldlm_bl_thread_main+0x112/0x700 [ptlrpc] [481516.051397] [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20 [481516.051426] [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc] [481516.051427] [<ffffffff9d6c1c31>] ? kthread+0xd1/0xe0 [481516.051429] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481516.051431] [<ffffffff9dd74c24>] ? ret_from_fork_nospec_begin+0xe/0x21 [481516.051432] [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40 [481516.051432] Task dump for CPU 15:
filesystem is still running but some threads are at 100% on the MDS
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 54053 root 20 0 0 0 0 R 100.0 0.0 2:02.22 ldlm_bl_01 55962 root 20 0 0 0 0 R 100.0 0.0 2:01.79 ldlm_bl_03 56043 root 20 0 0 0 0 R 100.0 0.0 3:20.41 ldlm_bl_04 63820 root 20 0 0 0 0 R 100.0 0.0 10:13.65 mdt03_033
[root@fir-md1-s1 ~]# lctl get_param ldlm.*.*mdt*.* ldlm.namespaces.mdt-fir-MDT0000_UUID.contended_locks=32 ldlm.namespaces.mdt-fir-MDT0000_UUID.contention_seconds=2 ldlm.namespaces.mdt-fir-MDT0000_UUID.ctime_age_limit=10 ldlm.namespaces.mdt-fir-MDT0000_UUID.dirty_age_limit=10 ldlm.namespaces.mdt-fir-MDT0000_UUID.early_lock_cancel=0 ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=146645 ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_timeouts=8 ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_unused_count=0 ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_max_age=3900000 ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_size=4800 ldlm.namespaces.mdt-fir-MDT0000_UUID.max_nolock_bytes=0 ldlm.namespaces.mdt-fir-MDT0000_UUID.max_parallel_ast=1024 ldlm.namespaces.mdt-fir-MDT0000_UUID.resource_count=62361 ldlm.namespaces.mdt-fir-MDT0002_UUID.contended_locks=32 ldlm.namespaces.mdt-fir-MDT0002_UUID.contention_seconds=2 ldlm.namespaces.mdt-fir-MDT0002_UUID.ctime_age_limit=10 ldlm.namespaces.mdt-fir-MDT0002_UUID.dirty_age_limit=10 ldlm.namespaces.mdt-fir-MDT0002_UUID.early_lock_cancel=0 ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=6472231 ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_timeouts=166760 ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_unused_count=0 ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_max_age=3900000 ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_size=4800 ldlm.namespaces.mdt-fir-MDT0002_UUID.max_nolock_bytes=0 ldlm.namespaces.mdt-fir-MDT0002_UUID.max_parallel_ast=1024 ldlm.namespaces.mdt-fir-MDT0002_UUID.resource_count=359599
MDS resumed operations by itself, the last logs were:
[482208.719700] LustreError: 63820:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549999197, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8995f523f740/0x1a4b7ac741771fee lrc: 3/0,1 mode: --/PW res: [0x2c00033c7:0x2b6e:0x0].0x0 bits 0x40/0x0 rrc: 39993 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 63820 timeout: 0 lvb_type: 0 [482361.725697] LNet: Service thread pid 63820 completed after 1225.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [482666.099950] Lustre: fir-MDT0002: Client 0ccdc4e2-9749-c9a5-afb4-85874ce74d6c (at 10.0.10.3@o2ib7) reconnecting
I'm attaching the full syslog output.
Now everything looks good even though this MDS is still quite loaded in terms of CPU.
Thanks,
Stephane