Details
Description
MDS would sometimes have all cpu cores with cpu sys 100%, consumed all by ldlm_cn processes.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4362 root 20 0 0 0 0 R 80.3 0.0 3:20.09 ldlm_cn00_009
3575 root 20 0 0 0 0 R 78.4 0.0 3:14.47 ldlm_cn00_001
...... [a lot more]
When this occurs, the mds server totally does not service, that is, all clients get near to stuck on every lustre operation.
I am wondering if this might be caused by too many lock cancellings?
Sample logs would be like below:
Aug 28 15:40:13 mds01 kernel: Lustre: 173646:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff887c82057b00 x1606968047458384/t0(0) o103>0dd14e56-602f-6dfd-904c-c16839ee8771@10.5.130.198@o2ib:22/0 lens 328/192 e 0 to 0 dl 1535442012 ref 1 fl Complete:/0/0 rc 0/0
Aug 28 15:40:13 mds01 kernel: Lustre: shlfs1-MDT0000: Client bd0cfc4b-0556-18d5-e2f1-d639eae3e01d (at 10.5.130.204@o2ib) reconnecting
Aug 28 15:40:13 mds01 kernel: Lustre: Skipped 2 previous similar messages
Aug 28 15:40:13 mds01 kernel: Lustre: shlfs1-MDT0000: Connection restored to 1107c10e-a39c-cb3c-70cd-a12f619c2dc7 (at 10.5.130.204@o2ib)
Aug 28 15:40:13 mds01 kernel: Lustre: Skipped 2 previous similar messages
Aug 28 15:40:27 mds01 kernel: Lustre: 195350:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff887ab4261e00 x1606966870393364/t0(0) o103->71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 9552:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff883b11cd9200 x1606966870394288/t0(0) o103->71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 9552:0:(service.c:1961:ptlrpc_server_handle_req_in()) Skipped 49 previous similar messages
Aug 28 15:40:27 mds01 kernel: Lustre: 2827:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff887ac59e2050 x1606964113976448/t0(0) o103->bf7e471f-a94d-caa4-7d7a-d7d587e62d38@10.5.130.207@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 2827:0:(service.c:1961:ptlrpc_server_handle_req_in()) Skipped 266 previous similar messages
Aug 28 15:40:27 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=7 reqQ=61491 recA=0, svcEst=36, delay=945(jiff)
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883b12896600 x1606964113937204/t0(0) o103>bf7e471f-a94d-caa4-7d7a-d7d587e62d38@10.5.130.207@o2ib:36/0 lens 328/0 e 0 to 0 dl 1535442026 ref 2 fl New:/2/ffffffff rc 0/-1
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 15326 previous similar messages
Aug 28 15:40:28 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
Aug 28 15:45:13 mds01 kernel: LNet: Service thread pid 5649 was inactive for 200.66s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 28 15:45:15 mds01 kernel: Pid: 5649, comm: mdt01_041
Aug 28 15:45:18 mds01 kernel: #012Call Trace:
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0aa9f91>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
Aug 28 15:45:18 mds01 kernel: [<ffffffff816ab6b9>] schedule+0x29/0x70
Aug 28 15:45:18 mds01 kernel: [<ffffffff816a9004>] schedule_timeout+0x174/0x2c0
Aug 28 15:45:18 mds01 kernel: [<ffffffff8109a6c0>] ? process_timeout+0x0/0x10
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0900e81>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffff810c6440>] ? default_wake_function+0x0/0x20
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6f730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d260>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1144f10>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1154352>] mdt_object_local_lock+0x452/0xaf0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1144f10>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c6d260>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0ac9165>] ? lu_object_find_at+0x95/0x290 [obdclass]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1154a4e>] mdt_object_lock_internal+0x5e/0x300 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1154d04>] mdt_object_lock+0x14/0x20 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc11784a1>] mdt_reint_open+0xc71/0x31a0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0adc4ce>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc115eaa3>] ? ucred_set_jobid+0x53/0x70 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc116e8a0>] mdt_reint_rec+0x80/0x210 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc115030b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc1150832>] mdt_intent_reint+0x162/0x430 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc115b59e>] mdt_intent_policy+0x43e/0xc70 [mdt]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c5a12f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c53277>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0c7c903>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0ca4ae0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0d01ea2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0d05da5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0caeb16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0cab148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffff810c6452>] ? default_wake_function+0x12/0x20
Aug 28 15:45:18 mds01 kernel: [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0cb2252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffffc0cb17c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
Aug 28 15:45:18 mds01 kernel: [<ffffffff810b252f>] kthread+0xcf/0xe0
Aug 28 15:45:18 mds01 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0
Aug 28 15:45:18 mds01 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Aug 28 15:45:18 mds01 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0
Aug 28 15:45:18 mds01 kernel:
Aug 28 15:45:18 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1535442313.5649
Aug 28 15:45:18 mds01 kernel: Lustre: 173675:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (93s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883a352ccb00 x1607113994636712/t0(0) o103><?>@<?>:232/0 lens 328/0 e 0 to 0 dl 1535442222 ref 2 fl New:/0/ffffffff rc 0/-1
Aug 28 15:45:18 mds01 kernel: Lustre: 173675:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 430820 previous similar messages
Aug 28 15:45:21 mds01 kernel: Lustre: 16782:0:(service.c:2112:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (45:264s); client may timeout. req@ffff883b9911ec00 x1606966869834560/t0(0) o103->71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:67/0 lens 328/0 e 0 to 0 dl 1535442057 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Aug 28 15:45:21 mds01 kernel: Lustre: 16782:0:(service.c:2112:ptlrpc_server_handle_request()) Skipped 6215 previous similar messages
Aug 28 15:45:36 mds01 kernel: LustreError: 5581:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff883a2f27e600 x1609956713313792/t0(0) o104->shlfs1-MDT0000@10.5.130.231@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Aug 28 15:45:43 mds01 kernel: Lustre: MGS: haven't heard from client 0c714426-f6a5-e005-8bca-baf99256c140 (at 10.5.130.54@o2ib) in 209 seconds. I think it's dead, and I am evicting it. exp ffff887f7a386800, cur 1535442342 expire 1535442192 last 1535442133
Aug 28 15:45:47 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
Aug 28 15:45:47 mds01 kernel: Lustre: Skipped 8172 previous similar messages
Aug 28 15:45:47 mds01 kernel: Lustre: 9552:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=44471 reqQ=1316948 recA=0, svcEst=364, delay=703(jiff)
Aug 28 15:45:47 mds01 kernel: Lustre: 9552:0:(service.c:1511:ptlrpc_at_check_timed()) Skipped 8173 previous similar messages
Aug 28 15:45:49 mds01 kernel: LNet: Service thread pid 2647 was inactive for 200.56s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: