[LU-12354] MDT threads stuck at ldlm_completion_ast Created: 29/May/19 Updated: 06/Mar/23 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Hongchao Zhang |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
MDT with very high load and lots of threads stuck at ldlm_completion_ast and May be a dup of [2418705.962173] LNet: Service thread pid 34570 was inactive for 400.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [2418706.013418] Pid: 34570, comm: mdt01_236 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019 [2418706.013419] Call Trace: [2418706.013433] [<ffffffffa0beba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [2418706.034654] [<ffffffffa0becb53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [2418706.034673] [<ffffffffa1248d72>] mdt_object_local_lock+0x512/0xb00 [mdt] [2418706.034680] [<ffffffffa12493be>] mdt_object_lock_internal+0x5e/0x300 [mdt] [2418706.034687] [<ffffffffa124a164>] mdt_getattr_name_lock+0x8a4/0x1910 [mdt] [2418706.034694] [<ffffffffa124b480>] mdt_intent_getattr+0x2b0/0x480 [mdt] [2418706.034701] [<ffffffffa124746b>] mdt_intent_opc+0x1eb/0xaf0 [mdt] [2418706.034708] [<ffffffffa124fd08>] mdt_intent_policy+0x138/0x320 [mdt] [2418706.034727] [<ffffffffa0bd12cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc] [2418706.034751] [<ffffffffa0bfac13>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc] [2418706.034787] [<ffffffffa0c80622>] tgt_enqueue+0x62/0x210 [ptlrpc] [2418706.034818] [<ffffffffa0c8428a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [2418706.034845] [<ffffffffa0c2c6cb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [2418706.034870] [<ffffffffa0c306b2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [2418706.034873] [<ffffffff810b1131>] kthread+0xd1/0xe0 [2418706.034875] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 [2418706.034894] [<ffffffffffffffff>] 0xffffffffffffffff [2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570
PID: 32153 TASK: ffff883ed28d5ee0 CPU: 10 COMMAND: "osp-syn-153-0"
#0 [ffff883ed292f908] __schedule at ffffffff816946af
#1 [ffff883ed292f990] schedule at ffffffff81694cd9
#2 [ffff883ed292f9a0] osp_sync_process_queues at ffffffffa13f8197 [osp]
#3 [ffff883ed292fab0] llog_process_thread at ffffffffa09ad8bb [obdclass]
#4 [ffff883ed292fb78] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
#5 [ffff883ed292fbc0] llog_cat_process_cb at ffffffffa09b3dad [obdclass]
#6 [ffff883ed292fc30] llog_process_thread at ffffffffa09ad8bb [obdclass]
#7 [ffff883ed292fcf8] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
#8 [ffff883ed292fd40] llog_cat_process_or_fork at ffffffffa09b2e49 [obdclass]
#9 [ffff883ed292fdb8] llog_cat_process at ffffffffa09b2f7e [obdclass]
#10 [ffff883ed292fdd8] osp_sync_thread at ffffffffa13f63af [osp]
#11 [ffff883ed292fec8] kthread at ffffffff810b1131
#12 [ffff883ed292ff50] ret_from_fork at ffffffff816a14f7
And
PID: 33442 TASK: ffff881fcfa38000 CPU: 21 COMMAND: "mdt01_073"
#0 [ffff883eb683f718] __schedule at ffffffff816946af
#1 [ffff883eb683f7a8] schedule at ffffffff81694cd9
#2 [ffff883eb683f7b8] schedule_timeout at ffffffff81692574
#3 [ffff883eb683f860] ldlm_completion_ast at ffffffffa0beba11 [ptlrpc]
#4 [ffff883eb683f900] ldlm_cli_enqueue_local at ffffffffa0becb53 [ptlrpc]
#5 [ffff883eb683f998] mdt_object_local_lock at ffffffffa1248d72 [mdt]
#6 [ffff883eb683fa48] mdt_object_lock_internal at ffffffffa12493be [mdt]
#7 [ffff883eb683fa98] mdt_getattr_name_lock at ffffffffa124a164 [mdt]
#8 [ffff883eb683fb20] mdt_intent_getattr at ffffffffa124b480 [mdt]
#9 [ffff883eb683fb60] mdt_intent_opc at ffffffffa124746b [mdt]
#10 [ffff883eb683fbc0] mdt_intent_policy at ffffffffa124fd08 [mdt]
#11 [ffff883eb683fbf8] ldlm_lock_enqueue at ffffffffa0bd12cd [ptlrpc]
#12 [ffff883eb683fc50] ldlm_handle_enqueue0 at ffffffffa0bfac13 [ptlrpc]
#13 [ffff883eb683fce0] tgt_enqueue at ffffffffa0c80622 [ptlrpc]
#14 [ffff883eb683fd00] tgt_request_handle at ffffffffa0c8428a [ptlrpc]
#15 [ffff883eb683fd48] ptlrpc_server_handle_request at ffffffffa0c2c6cb [ptlrpc]
#16 [ffff883eb683fde8] ptlrpc_main at ffffffffa0c306b2 [ptlrpc]
#17 [ffff883eb683fec8] kthread at ffffffff810b1131
#18 [ffff883eb683ff50] ret_from_fork at ffffffff816a14f7
|
| Comments |
| Comment by Peter Jones [ 30/May/19 ] |
|
Hongchao Could you please investigate? Thanks Peter |
| Comment by Patrick Farrell (Inactive) [ 30/May/19 ] |
|
This is not related to It would be good to have dmesg from the MDS and also the dumped lustre log(s), like the one listed here: [2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570 |
| Comment by Mahmoud Hanafi [ 31/May/19 ] |
|
Attaching dmesg. We don't have /tmp/lustre-log.1557763688.34570 it was deleted. vmcore-dmesg.txt.gz |
| Comment by Aurelien Degremont (Inactive) [ 04/Jun/19 ] |
|
@mhanafi, there is plenty of stack traces in your vmcode logs related to stuck read I/O like: [2415376.231996] Pid: 35506, comm: mdt_rdpg00_051 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019 [2415376.231998] Call Trace: [2415376.232023] [<ffffffff8123499a>] __wait_on_buffer+0x2a/0x30 [2415376.249546] [<ffffffffa0ee553c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs] [2415376.249554] [<ffffffffa0ed3e7a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs] [2415376.249560] [<ffffffffa0ed4270>] htree_dirblock_to_tree+0x40/0x190 [ldiskfs] [2415376.249566] [<ffffffffa0ed6451>] ldiskfs_htree_fill_tree+0x201/0x2f0 [ldiskfs] Are you sure you do not have any disk issue? This could be a side effect but you should first verify there is not a disk issue that prevent Lustre threads from doing I/O. |
| Comment by Hongchao Zhang [ 05/Jun/19 ] |
|
As per the logs, there is an IBITS lock on root directory is holding by client at 10.151.10.140, but there is network error 2418411.342868] LNetError: 21360:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 5 seconds [2418411.373829] LNetError: 21360:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.10.140@o2ib (207): c: 54, oc: 0, rc: 63 [2418411.411904] Lustre: 33336:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1557763288/real 1557763393] req@ffff880139dd6600 x1630930117996992/t0(0) o104->nbp2-MDT0000@10.151.10.140@o2ib:15/16 lens 296/224 e 0 to 1 dl 1557763743 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 After 407 seconds, the lock waiting timer is expired and released, so all other lock enqueue request are blocked [2418713.129522] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 407s: evicting client at 10.151.10.140@o2ib ns: mdt-nbp2-MDT0000_UUID lock: ffff8810b458a200/0x843e968fdb2cd3cc lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 2189 type: IBT flags: 0x60200400000020 nid: 10.151.10.140@o2ib remote: 0x7e3f2695c7cccb4e expref: 160144 pid: 33420 timeout: 6713595267 lvb_type: 0 What is the output of "lctl get_param at_max", "lctl get_param timeout"? |
| Comment by Mahmoud Hanafi [ 06/Jun/19 ] |
|
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
We can close this for now |