[LU-11305] MDS' all cpu cores eaten up by ldlm_cn [100% cpu sys] - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.3
Labels:
- ldlm
Environment:
centos 7.4, SSD MDT * 2
MDS: E5-2697A v4 * 2, 512GB

Epic/Theme:
- ldlm
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

MDS would sometimes have all cpu cores with cpu sys 100%, consumed all by ldlm_cn processes.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4362 root 20 0 0 0 0 R 80.3 0.0 3:20.09 ldlm_cn00_009
3575 root 20 0 0 0 0 R 78.4 0.0 3:14.47 ldlm_cn00_001

...... [a lot more]

When this occurs, the mds server totally does not service, that is, all clients get near to stuck on every lustre operation.

I am wondering if this might be caused by too many lock cancellings?

Sample logs would be like below:

Aug 28 15:40:13 mds01 kernel: Lustre: 173646:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (~~1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff887c82057b00 x1606968047458384/t0(0) o103~~>0dd14e56-602f-6dfd-904c-c16839ee8771@10.5.130.198@o2ib:22/0 lens 328/192 e 0 to 0 dl 1535442012 ref 1 fl Complete:/0/0 rc 0/0
Aug 28 15:40:13 mds01 kernel: Lustre: shlfs1-MDT0000: Client bd0cfc4b-0556-18d5-e2f1-d639eae3e01d (at 10.5.130.204@o2ib) reconnecting
Aug 28 15:40:13 mds01 kernel: Lustre: Skipped 2 previous similar messages
Aug 28 15:40:13 mds01 kernel: Lustre: shlfs1-MDT0000: Connection restored to 1107c10e-a39c-cb3c-70cd-a12f619c2dc7 (at 10.5.130.204@o2ib)
Aug 28 15:40:13 mds01 kernel: Lustre: Skipped 2 previous similar messages
Aug 28 15:40:27 mds01 kernel: Lustre: 195350:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff887ab4261e00 x1606966870393364/t0(0) o103->71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 9552:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff883b11cd9200 x1606966870394288/t0(0) o103->71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 9552:0:(service.c:1961:ptlrpc_server_handle_req_in()) Skipped 49 previous similar messages
Aug 28 15:40:27 mds01 kernel: Lustre: 2827:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff887ac59e2050 x1606964113976448/t0(0) o103->bf7e471f-a94d-caa4-7d7a-d7d587e62d38@10.5.130.207@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 2827:0:(service.c:1961:ptlrpc_server_handle_req_in()) Skipped 266 previous similar messages
Aug 28 15:40:27 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=7 reqQ=61491 recA=0, svcEst=36, delay=945(jiff)
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (~~1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883b12896600 x1606964113937204/t0(0) o103~~>bf7e471f-a94d-caa4-7d7a-d7d587e62d38@10.5.130.207@o2ib:36/0 lens 328/0 e 0 to 0 dl 1535442026 ref 2 fl New:/2/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 15326 previous similar messages
Aug 28 15:40:28 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).

Aug 28 15:45:13 mds01 kernel: LNet: Service thread pid 5649 was inactive for 200.66s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 28 15:45:15 mds01 kernel: Pid: 5649, comm: mdt01_041
Aug 28 15:45:18 mds01 kernel: #012Call Trace:
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0aa9f91>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
Aug 28 15:45:18 mds01 kernel: [<ffffffff816ab6b9>] schedule+0x29/0x70
Aug 28 15:45:18 mds01 kernel: [<ffffffff816a9004>] schedule_timeout+0x174/0x2c0
Aug 28 15:45:18 mds01 kernel: [<ffffffff8109a6c0>] ? process_timeout+0x0/0x10
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0900e81>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffff810c6440>] ? default_wake_function+0x0/0x20
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6f730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d260>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1144f10>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1154352>] mdt_object_local_lock+0x452/0xaf0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1144f10>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d260>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0ac9165>] ? lu_object_find_at+0x95/0x290 [obdclass]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1154a4e>] mdt_object_lock_internal+0x5e/0x300 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1154d04>] mdt_object_lock+0x14/0x20 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc11784a1>] mdt_reint_open+0xc71/0x31a0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0adc4ce>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc115eaa3>] ? ucred_set_jobid+0x53/0x70 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc116e8a0>] mdt_reint_rec+0x80/0x210 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc115030b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1150832>] mdt_intent_reint+0x162/0x430 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc115b59e>] mdt_intent_policy+0x43e/0xc70 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c5a12f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c53277>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c7c903>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0ca4ae0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0d01ea2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0d05da5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0caeb16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0cab148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffff810c6452>] ? default_wake_function+0x12/0x20
Aug 28 15:45:18 mds01 kernel: [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0cb2252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0cb17c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffff810b252f>] kthread+0xcf/0xe0
Aug 28 15:45:18 mds01 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0
Aug 28 15:45:18 mds01 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Aug 28 15:45:18 mds01 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0
Aug 28 15:45:18 mds01 kernel:
Aug 28 15:45:18 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1535442313.5649
Aug 28 15:45:18 mds01 kernel: Lustre: 173675:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (~~93s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883a352ccb00 x1607113994636712/t0(0) o103~~><?>@<?>:232/0 lens 328/0 e 0 to 0 dl 1535442222 ref 2 fl New:/0/ffffffff rc 0/-1
Aug 28 15:45:18 mds01 kernel: Lustre: 173675:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 430820 previous similar messages
Aug 28 15:45:21 mds01 kernel: Lustre: 16782:0:(service.c:2112:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (45:264s); client may timeout. req@ffff883b9911ec00 x1606966869834560/t0(0) o103->71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:67/0 lens 328/0 e 0 to 0 dl 1535442057 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Aug 28 15:45:21 mds01 kernel: Lustre: 16782:0:(service.c:2112:ptlrpc_server_handle_request()) Skipped 6215 previous similar messages
Aug 28 15:45:36 mds01 kernel: LustreError: 5581:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff883a2f27e600 x1609956713313792/t0(0) o104->shlfs1-MDT0000@10.5.130.231@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Aug 28 15:45:43 mds01 kernel: Lustre: MGS: haven't heard from client 0c714426-f6a5-e005-8bca-baf99256c140 (at 10.5.130.54@o2ib) in 209 seconds. I think it's dead, and I am evicting it. exp ffff887f7a386800, cur 1535442342 expire 1535442192 last 1535442133
Aug 28 15:45:47 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
Aug 28 15:45:47 mds01 kernel: Lustre: Skipped 8172 previous similar messages
Aug 28 15:45:47 mds01 kernel: Lustre: 9552:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=44471 reqQ=1316948 recA=0, svcEst=364, delay=703(jiff)
Aug 28 15:45:47 mds01 kernel: Lustre: 9552:0:(service.c:1511:ptlrpc_at_check_timed()) Skipped 8173 previous similar messages
Aug 28 15:45:49 mds01 kernel: LNet: Service thread pid 2647 was inactive for 200.56s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt.log
19/Sep/19 3:49 PM
1.45 MB
Aurelien Degremont

Activity

[LU-11305] MDS' all cpu cores eaten up by ldlm_cn [100% cpu sys]

Yang Sheng added a comment - 26/Jul/22 3:54 AM

Please feel free to reopen it once have new info.

Yang Sheng added a comment - 26/Jul/22 3:54 AM Please feel free to reopen it once have new info.

Aurelien Degremont (Inactive) added a comment - 19/Sep/19 3:50 PM

foreach_bt.log

Not easy to provide you with a vmcore, but here a 'foreach bt' log. (This is a new occurence of the problem, do not compare this output with the exact output I posted above)

Aurelien Degremont (Inactive) added a comment - 19/Sep/19 3:50 PM foreach_bt.log Not easy to provide you with a vmcore, but here a 'foreach bt' log. (This is a new occurence of the problem, do not compare this output with the exact output I posted above)

Yang Sheng added a comment - 18/Sep/19 8:55 AM

Hi, Aurelien,

Have possible to share the vmcore with us? Or only paste output of 'foreach bt' in crash. Many thanks for your help.

Thanks,
YangSheng

Yang Sheng added a comment - 18/Sep/19 8:55 AM Hi, Aurelien, Have possible to share the vmcore with us? Or only paste output of 'foreach bt' in crash. Many thanks for your help. Thanks, YangSheng

Aurelien Degremont (Inactive) added a comment - 16/Sep/19 10:29 PM

We were able to reproduce a very similar issue.

We are seeing an unresponsive MDS, all CPUs at 100% with lots of ldlm_cn_xxx threads and few ldlm_bl threads consumming all the CPUs.

ldlm_cn threads have this kind of strack trace:

 Aug 26 13:38:31 ip-198-19-95-73 kernel: Lustre: 79258:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:136s); client may timeout. req@ffff88d6dfb47850 x1642929689998160/t0(0) o103->0e4989ad-ae70-2c70-9db7-05f5fafae50e@10.26.86.9@tcp:745/0 lens 3584/0 e 0 to 0 dl 1566826575 ref 1 fl Interpret:/2/ffffffff rc 0/-1
 Aug 26 13:38:31 ip-198-19-95-73 kernel: Lustre: 79258:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 11126 previous similar messages
 Aug 26 13:39:11 ip-198-19-95-73 kernel: LNet: Service thread pid 14956 was inactive for 204.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
 Aug 26 13:39:11 ip-198-19-95-73 kernel: Pid: 14956, comm: mdt01_094 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019
 Aug 26 13:39:11 ip-198-19-95-73 kernel: Call Trace:
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_completion_ast+0x63e/0x910 [ptlrpc]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_cli_enqueue_local+0x27c/0x880 [ptlrpc]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_object_local_lock+0x759/0xa00 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_object_lock_internal+0x5b/0x2e0 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_open+0xb4e/0x3020 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_rec+0x7f/0x250 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_internal+0x5ee/0x680 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_reint+0x173/0x3e0 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_opc+0x385/0xa00 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_policy+0x1c5/0x370 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]

A couple of ldlm_bl_NN have this kind of stack, obtained from crash bt -T:

[ffffc9001eecfa58] try_to_wake_up at ffffffff810b3b74
 [ffffc9001eecfac8] __update_load_avg_se at ffffffff810b9258
 [ffffc9001eecfad8] __update_load_avg_se at ffffffff810b9258
 [ffffc9001eecfb10] update_load_avg at ffffffff810bbacd
 [ffffc9001eecfb30] account_entity_enqueue at ffffffff810bdb45
 [ffffc9001eecfb68] enqueue_entity at ffffffff810bfa63
 [ffffc9001eecfbf8] tgt_blocking_ast at ffffffffa0c78a28 [ptlrpc]
 [ffffc9001eecfc40] ldlm_export_lock_hash at ffffffffa0bea6d0 [ptlrpc]
 --------------
 [ffffc9001eecfc48] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs]
 [ffffc9001eecfc68] cfs_hash_rw_lock at ffffffffa084f370 [libcfs]
 [ffffc9001eecfc70] cfs_hash_hh_hnode_del at ffffffffa084f3e0 [libcfs]
 [ffffc9001eecfc78] cfs_hash_bd_del_locked at ffffffffa084f7df [libcfs]
 [ffffc9001eecfc90] cfs_hash_rw_unlock at ffffffffa084ff20 [libcfs]
 [ffffc9001eecfc98] cfs_hash_del at ffffffffa085305e [libcfs]
 [ffffc9001eecfcd8] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs]
 [ffffc9001eecfd18] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs]
 [ffffc9001eecfd80] ldlm_lock_put at ffffffffa0bc3f73 [ptlrpc]
 [ffffc9001eecfd90] cfs_hash_hh_hhead at ffffffffa084f390 [libcfs]
 [ffffc9001eecfd98] cfs_hash_for_each_relax at ffffffffa0850458 [libcfs]
 [ffffc9001eecfda0] ldlm_cancel_locks_for_export_cb at ffffffffa0bcbe50 [ptlrpc]
 [ffffc9001eecfdb8] ldlm_export_lock_put at ffffffffa0bea740 [ptlrpc]
 [ffffc9001eecfdf0] ldlm_cancel_locks_for_export_cb at ffffffffa0bcbe50 [ptlrpc]
 [ffffc9001eecfe18] cfs_hash_for_each_empty at ffffffffa08538b8 [libcfs]
 [ffffc9001eecfe58] ldlm_export_cancel_locks at ffffffffa0bcc10f [ptlrpc]
 [ffffc9001eecfe80] ldlm_bl_thread_main at ffffffffa0bf5147 [ptlrpc]
 [ffffc9001eecfec0] default_wake_function at ffffffff810b4040
 [ffffc9001eecff08] ldlm_bl_thread_main at ffffffffa0bf4c60 [ptlrpc]
 [ffffc9001eecff10] kthread at ffffffff810a90fa
 [ffffc9001eecff20] kthread at ffffffff810a8fe0
 [ffffc9001eecff50] ret_from_fork at ffffffff81800205

perf top :

Samples: 10M of event 'cpu-clock', Event count (approx.): 118365912201 
 Overhead Shared Object Symbol 
 45.75% [libcfs] [k] cfs_hash_hh_hhead
 21.23% [libcfs] [k] cfs_hash_for_each_relax
 6.93% [kernel] [k] __x86_indirect_thunk_rax
 3.85% [obdclass] [k] class_handle2object
 2.72% [kernel] [k] _raw_read_lock
 2.46% [libcfs] [k] cfs_hash_rw_unlock

This issue appears when there are lots of locks in the MDT namespace.

ldlm.namespaces.mdt-lustre-MDT0000_UUID.contended_locks=32
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.contention_seconds=2
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.ctime_age_limit=10
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.early_lock_cancel=0
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_count=46344331
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_timeouts=55
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_unused_count=0
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lru_max_age=3900000
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lru_size=9600
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.max_nolock_bytes=0
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.max_parallel_ast=1024
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.resource_count=14281836

ldlm.namespaces.mdt-lustre-MDT0000_UUID.pool.stats=
snapshot_time 1568643075.466635065 secs.nsecs
granted 59817 samples [locks] 0 55589762 171302800233 8049595240895311875
grant 153081925 samples [locks] 1 1 153081925 153081925
cancel 106817757 samples [locks] 1 1 106817757 106817757
grant_rate 59817 samples [locks/s] 0 527737 153081896 36151088814526
cancel_rate 59817 samples [locks/s] 0 315229 106817738 15370244224014
grant_plan 59817 samples [locks/s] 382861 47857748 189705106051 7747978793807775269
slv 59817 samples [slv] 1 1378303200000 77899468648378585 6008560669154165631
recalc_freed 259959295 samples [locks] 0 0 0 0
recalc_timing 59791 samples [sec] 1 52689 140658 3151201290

MDT is receiving plenty of LDLM_CANCEL from clients.
As we have 747 GB of RAM, LDLM computes a LDLM pool limit to 36 millions locks.
When number of active locks get over this number the MDT can't cope with that and the problem starts appearing
If we control the ldlm tuning on client side to limit the number of locks, we don't see the problem.
Based on the MDT HW we are using, the threshold is around 10 millions locks.

For anybody hitting a similar issue I would recommend checking the number of locks on server side and limit it on client side

ldlm.namespace..lru_max_age and ldlm.namespace..lru_size

Is this a scalibility issue of the LDLM code? Too many active locks?
Or do you think the problem is elsewhere?

Aurelien Degremont (Inactive) added a comment - 16/Sep/19 10:29 PM We were able to reproduce a very similar issue. We are seeing an unresponsive MDS, all CPUs at 100% with lots of ldlm_cn_xxx threads and few ldlm_bl threads consumming all the CPUs. ldlm_cn threads have this kind of strack trace: Aug 26 13:38:31 ip-198-19-95-73 kernel: Lustre: 79258:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:136s); client may timeout. req@ffff88d6dfb47850 x1642929689998160/t0(0) o103->0e4989ad-ae70-2c70-9db7-05f5fafae50e@10.26.86.9@tcp:745/0 lens 3584/0 e 0 to 0 dl 1566826575 ref 1 fl Interpret:/2/ffffffff rc 0/-1 Aug 26 13:38:31 ip-198-19-95-73 kernel: Lustre: 79258:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 11126 previous similar messages Aug 26 13:39:11 ip-198-19-95-73 kernel: LNet: Service thread pid 14956 was inactive for 204.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Aug 26 13:39:11 ip-198-19-95-73 kernel: Pid: 14956, comm: mdt01_094 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019 Aug 26 13:39:11 ip-198-19-95-73 kernel: Call Trace: Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_completion_ast+0x63e/0x910 [ptlrpc] Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_cli_enqueue_local+0x27c/0x880 [ptlrpc] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_object_local_lock+0x759/0xa00 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_object_lock_internal+0x5b/0x2e0 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_open+0xb4e/0x3020 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_rec+0x7f/0x250 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_internal+0x5ee/0x680 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_reint+0x173/0x3e0 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_opc+0x385/0xa00 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_policy+0x1c5/0x370 [mdt] Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_lock_enqueue+0x234/0x950 [ptlrpc] Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc] A couple of ldlm_bl_NN have this kind of stack, obtained from crash bt -T: [ffffc9001eecfa58] try_to_wake_up at ffffffff810b3b74 [ffffc9001eecfac8] __update_load_avg_se at ffffffff810b9258 [ffffc9001eecfad8] __update_load_avg_se at ffffffff810b9258 [ffffc9001eecfb10] update_load_avg at ffffffff810bbacd [ffffc9001eecfb30] account_entity_enqueue at ffffffff810bdb45 [ffffc9001eecfb68] enqueue_entity at ffffffff810bfa63 [ffffc9001eecfbf8] tgt_blocking_ast at ffffffffa0c78a28 [ptlrpc] [ffffc9001eecfc40] ldlm_export_lock_hash at ffffffffa0bea6d0 [ptlrpc] -------------- [ffffc9001eecfc48] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs] [ffffc9001eecfc68] cfs_hash_rw_lock at ffffffffa084f370 [libcfs] [ffffc9001eecfc70] cfs_hash_hh_hnode_del at ffffffffa084f3e0 [libcfs] [ffffc9001eecfc78] cfs_hash_bd_del_locked at ffffffffa084f7df [libcfs] [ffffc9001eecfc90] cfs_hash_rw_unlock at ffffffffa084ff20 [libcfs] [ffffc9001eecfc98] cfs_hash_del at ffffffffa085305e [libcfs] [ffffc9001eecfcd8] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs] [ffffc9001eecfd18] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs] [ffffc9001eecfd80] ldlm_lock_put at ffffffffa0bc3f73 [ptlrpc] [ffffc9001eecfd90] cfs_hash_hh_hhead at ffffffffa084f390 [libcfs] [ffffc9001eecfd98] cfs_hash_for_each_relax at ffffffffa0850458 [libcfs] [ffffc9001eecfda0] ldlm_cancel_locks_for_export_cb at ffffffffa0bcbe50 [ptlrpc] [ffffc9001eecfdb8] ldlm_export_lock_put at ffffffffa0bea740 [ptlrpc] [ffffc9001eecfdf0] ldlm_cancel_locks_for_export_cb at ffffffffa0bcbe50 [ptlrpc] [ffffc9001eecfe18] cfs_hash_for_each_empty at ffffffffa08538b8 [libcfs] [ffffc9001eecfe58] ldlm_export_cancel_locks at ffffffffa0bcc10f [ptlrpc] [ffffc9001eecfe80] ldlm_bl_thread_main at ffffffffa0bf5147 [ptlrpc] [ffffc9001eecfec0] default_wake_function at ffffffff810b4040 [ffffc9001eecff08] ldlm_bl_thread_main at ffffffffa0bf4c60 [ptlrpc] [ffffc9001eecff10] kthread at ffffffff810a90fa [ffffc9001eecff20] kthread at ffffffff810a8fe0 [ffffc9001eecff50] ret_from_fork at ffffffff81800205 perf top : Samples: 10M of event 'cpu-clock', Event count (approx.): 118365912201 Overhead Shared Object Symbol 45.75% [libcfs] [k] cfs_hash_hh_hhead 21.23% [libcfs] [k] cfs_hash_for_each_relax 6.93% [kernel] [k] __x86_indirect_thunk_rax 3.85% [obdclass] [k] class_handle2object 2.72% [kernel] [k] _raw_read_lock 2.46% [libcfs] [k] cfs_hash_rw_unlock This issue appears when there are lots of locks in the MDT namespace. ldlm.namespaces.mdt-lustre-MDT0000_UUID.contended_locks=32 ldlm.namespaces.mdt-lustre-MDT0000_UUID.contention_seconds=2 ldlm.namespaces.mdt-lustre-MDT0000_UUID.ctime_age_limit=10 ldlm.namespaces.mdt-lustre-MDT0000_UUID.early_lock_cancel=0 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_count=46344331 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_timeouts=55 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_unused_count=0 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lru_max_age=3900000 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lru_size=9600 ldlm.namespaces.mdt-lustre-MDT0000_UUID.max_nolock_bytes=0 ldlm.namespaces.mdt-lustre-MDT0000_UUID.max_parallel_ast=1024 ldlm.namespaces.mdt-lustre-MDT0000_UUID.resource_count=14281836 ldlm.namespaces.mdt-lustre-MDT0000_UUID.pool.stats= snapshot_time 1568643075.466635065 secs.nsecs granted 59817 samples [locks] 0 55589762 171302800233 8049595240895311875 grant 153081925 samples [locks] 1 1 153081925 153081925 cancel 106817757 samples [locks] 1 1 106817757 106817757 grant_rate 59817 samples [locks/s] 0 527737 153081896 36151088814526 cancel_rate 59817 samples [locks/s] 0 315229 106817738 15370244224014 grant_plan 59817 samples [locks/s] 382861 47857748 189705106051 7747978793807775269 slv 59817 samples [slv] 1 1378303200000 77899468648378585 6008560669154165631 recalc_freed 259959295 samples [locks] 0 0 0 0 recalc_timing 59791 samples [sec] 1 52689 140658 3151201290 MDT is receiving plenty of LDLM_CANCEL from clients. As we have 747 GB of RAM, LDLM computes a LDLM pool limit to 36 millions locks. When number of active locks get over this number the MDT can't cope with that and the problem starts appearing If we control the ldlm tuning on client side to limit the number of locks, we don't see the problem. Based on the MDT HW we are using, the threshold is around 10 millions locks. For anybody hitting a similar issue I would recommend checking the number of locks on server side and limit it on client side ldlm.namespace. .lru_max_age and ldlm.namespace. .lru_size Is this a scalibility issue of the LDLM code? Too many active locks? Or do you think the problem is elsewhere?

Ian Costello added a comment - 09/Aug/19 2:36 AM

I am seeing the same issue as well however on the oss's, the similarities I am seeing are the ldlm threads consuming all except 2 cores (libcfs cpu_pattern) on the oss's (DNE).

Ian Costello added a comment - 09/Aug/19 2:36 AM I am seeing the same issue as well however on the oss's, the similarities I am seeing are the ldlm threads consuming all except 2 cores (libcfs cpu_pattern) on the oss's (DNE).

MDS' all cpu cores eaten up by ldlm_cn [100% cpu sys]

Details

Description

Attachments

Attachments

Activity

People

Dates