[LU-11964] Heavy load and soft lockups on MDS with DOM - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6, Linux Kernel 3.10.0-957.1.3.el7_lustre.x86_64, Lustre 2.12 clients (Sherlock cluster)

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

On Fir, our new scratch for Sherlock running 2.12, the filesystem hung today and then operations resumed. Fir is using DNE, PFL and DOM. The MDS hosting mdt[0,2] seems to have been under a heavy load. For example:

[481509.742936] Call Trace:
[481509.745520]  [<ffffffffc0f73303>] ldlm_process_inodebits_lock+0x93/0x3d0 [ptlrpc]
[481509.753123]  [<ffffffffc0f61fc0>] ? ldlm_export_lock_object+0x10/0x10 [ptlrpc]
[481509.760464]  [<ffffffffc0f73270>] ? ldlm_inodebits_compat_queue+0x440/0x440 [ptlrpc]
[481509.768321]  [<ffffffffc0f4160e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
[481509.775403]  [<ffffffffc0f426cb>] __ldlm_reprocess_all+0x10b/0x380 [ptlrpc]
[481509.782482]  [<ffffffffc0f434a2>] ldlm_cancel_lock_for_export.isra.26+0x1c2/0x390 [ptlrpc]
[481509.790863]  [<ffffffffc0f43851>] ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc]
[481509.798987]  [<ffffffffc0f6c6e2>] ldlm_bl_thread_main+0x112/0x700 [ptlrpc]
[481509.805950]  [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20
[481509.811650]  [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc]
[481509.819130]  [<ffffffff9d6c1c31>] kthread+0xd1/0xe0
[481509.824095]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
[481509.830277]  [<ffffffff9dd74c24>] ret_from_fork_nospec_begin+0xe/0x21
[481509.836799]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
[481509.842979] Code: 74 0e 4c 89 f2 48 89 de 4c 89 ff e8 d3 a2 fc ff 49 8b 87 10 02 00 00 4c 8d a8 f0 fd ff ff 4d 39 fd 74 2b 49 83 bd a8 00 00 00 00 <74> 0e 4c 89 f2 48 89 de 4c 89 ef e8 a8 a2 fc ff 4d 8b ad 10 02 
[481516.050304] INFO: rcu_sched self-detected stall on CPU
[481516.051302] INFO: rcu_sched self-detected stall on CPU
[481516.051303]  {
[481516.051303]  15
[481516.051304] } 
[481516.051314]  (t=60001 jiffies g=115591743 c=115591742 q=6728165)
[481516.051315] Task dump for CPU 14:
[481516.051316] ldlm_bl_01      R
[481516.051316]   running task    
[481516.051318]     0 54053      2 0x00000088
[481516.051318] Call Trace:
[481516.051362]  [<ffffffffc0f43851>] ? ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc]
[481516.051394]  [<ffffffffc0f6c6e2>] ? ldlm_bl_thread_main+0x112/0x700 [ptlrpc]
[481516.051397]  [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20
[481516.051426]  [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc]
[481516.051427]  [<ffffffff9d6c1c31>] ? kthread+0xd1/0xe0
[481516.051429]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
[481516.051431]  [<ffffffff9dd74c24>] ? ret_from_fork_nospec_begin+0xe/0x21
[481516.051432]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
[481516.051432] Task dump for CPU 15:

filesystem is still running but some threads are at 100% on the MDS

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                      
 54053 root      20   0       0      0      0 R 100.0  0.0   2:02.22 ldlm_bl_01                                                                                                                                   
 55962 root      20   0       0      0      0 R 100.0  0.0   2:01.79 ldlm_bl_03                                                                                                                                   
 56043 root      20   0       0      0      0 R 100.0  0.0   3:20.41 ldlm_bl_04                                                                                                                                   
 63820 root      20   0       0      0      0 R 100.0  0.0  10:13.65 mdt03_033

[root@fir-md1-s1 ~]# lctl get_param ldlm.*.*mdt*.*
ldlm.namespaces.mdt-fir-MDT0000_UUID.contended_locks=32
ldlm.namespaces.mdt-fir-MDT0000_UUID.contention_seconds=2
ldlm.namespaces.mdt-fir-MDT0000_UUID.ctime_age_limit=10
ldlm.namespaces.mdt-fir-MDT0000_UUID.dirty_age_limit=10
ldlm.namespaces.mdt-fir-MDT0000_UUID.early_lock_cancel=0
ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=146645
ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_timeouts=8
ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_unused_count=0
ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_max_age=3900000
ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_size=4800
ldlm.namespaces.mdt-fir-MDT0000_UUID.max_nolock_bytes=0
ldlm.namespaces.mdt-fir-MDT0000_UUID.max_parallel_ast=1024
ldlm.namespaces.mdt-fir-MDT0000_UUID.resource_count=62361
ldlm.namespaces.mdt-fir-MDT0002_UUID.contended_locks=32
ldlm.namespaces.mdt-fir-MDT0002_UUID.contention_seconds=2
ldlm.namespaces.mdt-fir-MDT0002_UUID.ctime_age_limit=10
ldlm.namespaces.mdt-fir-MDT0002_UUID.dirty_age_limit=10
ldlm.namespaces.mdt-fir-MDT0002_UUID.early_lock_cancel=0
ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=6472231
ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_timeouts=166760
ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_unused_count=0
ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_max_age=3900000
ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_size=4800
ldlm.namespaces.mdt-fir-MDT0002_UUID.max_nolock_bytes=0
ldlm.namespaces.mdt-fir-MDT0002_UUID.max_parallel_ast=1024
ldlm.namespaces.mdt-fir-MDT0002_UUID.resource_count=359599

MDS resumed operations by itself, the last logs were:

[482208.719700] LustreError: 63820:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549999197, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8995f523f740/0x1a4b7ac741771fee lrc: 3/0,1 mode: --/PW res: [0x2c00033c7:0x2b6e:0x0].0x0 bits 0x40/0x0 rrc: 39993 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 63820 timeout: 0 lvb_type: 0
[482361.725697] LNet: Service thread pid 63820 completed after 1225.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[482666.099950] Lustre: fir-MDT0002: Client 0ccdc4e2-9749-c9a5-afb4-85874ce74d6c (at 10.0.10.3@o2ib7) reconnecting

I'm attaching the full syslog output.

Now everything looks good even though this MDS is still quite loaded in terms of CPU.

Thanks,
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s1.syslog-20190212.log.gz
303 kB
12/Feb/19 8:15 PM
fir-md1-s1-20190213.log
82 kB
13/Feb/19 5:29 PM
fir-md1-s1-dk.log
1.27 MB
13/Feb/19 5:46 PM
fir-md1-s2-dk.log
7.15 MB
13/Feb/19 7:42 PM
lustre-log.1550078704.21794
39 kB
13/Feb/19 5:31 PM
sh-106-60.dk.log.gz
4.11 MB
13/Feb/19 7:42 PM
sherlocks.txt
17 kB
14/Feb/19 6:30 PM

Issue Links

is related to

LU-12037 Possible DNE issue leading to hung filesystem

Resolved

is related to

LU-11976 req wrong generation leading to I/O errors on 2.12 clients

Resolved

LU-11951 sanity: test_231a failure, idle disconnect & import generation disagreement

Resolved

Heavy load and soft lockups on MDS with DOM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates