[LU-12354] MDT threads stuck at ldlm_completion_ast Created: 29/May/19  Updated: 06/Mar/23  Resolved: 06/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Hongchao Zhang
Resolution: Incomplete Votes: 0
Labels: None

Attachments: File vmcore-dmesg.txt.gz    
Issue Links:
Related
is related to LU-14611 racer test 1 hangs in ls/locking Open
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

MDT with very high load and lots of threads stuck at ldlm_completion_ast and
osp_sync_process_queues.

May be a dup of LU-11359

[2418705.962173] LNet: Service thread pid 34570 was inactive for 400.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[2418706.013418] Pid: 34570, comm: mdt01_236 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019
[2418706.013419] Call Trace:
[2418706.013433]  [<ffffffffa0beba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[2418706.034654]  [<ffffffffa0becb53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[2418706.034673]  [<ffffffffa1248d72>] mdt_object_local_lock+0x512/0xb00 [mdt]
[2418706.034680]  [<ffffffffa12493be>] mdt_object_lock_internal+0x5e/0x300 [mdt]
[2418706.034687]  [<ffffffffa124a164>] mdt_getattr_name_lock+0x8a4/0x1910 [mdt]
[2418706.034694]  [<ffffffffa124b480>] mdt_intent_getattr+0x2b0/0x480 [mdt]
[2418706.034701]  [<ffffffffa124746b>] mdt_intent_opc+0x1eb/0xaf0 [mdt]
[2418706.034708]  [<ffffffffa124fd08>] mdt_intent_policy+0x138/0x320 [mdt]
[2418706.034727]  [<ffffffffa0bd12cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
[2418706.034751]  [<ffffffffa0bfac13>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc]
[2418706.034787]  [<ffffffffa0c80622>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2418706.034818]  [<ffffffffa0c8428a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[2418706.034845]  [<ffffffffa0c2c6cb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[2418706.034870]  [<ffffffffa0c306b2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[2418706.034873]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[2418706.034875]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
[2418706.034894]  [<ffffffffffffffff>] 0xffffffffffffffff
[2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570
PID: 32153  TASK: ffff883ed28d5ee0  CPU: 10  COMMAND: "osp-syn-153-0"
 #0 [ffff883ed292f908] __schedule at ffffffff816946af
 #1 [ffff883ed292f990] schedule at ffffffff81694cd9
 #2 [ffff883ed292f9a0] osp_sync_process_queues at ffffffffa13f8197 [osp]
 #3 [ffff883ed292fab0] llog_process_thread at ffffffffa09ad8bb [obdclass]
 #4 [ffff883ed292fb78] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
 #5 [ffff883ed292fbc0] llog_cat_process_cb at ffffffffa09b3dad [obdclass]
 #6 [ffff883ed292fc30] llog_process_thread at ffffffffa09ad8bb [obdclass]
 #7 [ffff883ed292fcf8] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
 #8 [ffff883ed292fd40] llog_cat_process_or_fork at ffffffffa09b2e49 [obdclass]
 #9 [ffff883ed292fdb8] llog_cat_process at ffffffffa09b2f7e [obdclass]
#10 [ffff883ed292fdd8] osp_sync_thread at ffffffffa13f63af [osp]
#11 [ffff883ed292fec8] kthread at ffffffff810b1131
#12 [ffff883ed292ff50] ret_from_fork at ffffffff816a14f7

And

PID: 33442  TASK: ffff881fcfa38000  CPU: 21  COMMAND: "mdt01_073"
 #0 [ffff883eb683f718] __schedule at ffffffff816946af
 #1 [ffff883eb683f7a8] schedule at ffffffff81694cd9
 #2 [ffff883eb683f7b8] schedule_timeout at ffffffff81692574
 #3 [ffff883eb683f860] ldlm_completion_ast at ffffffffa0beba11 [ptlrpc]
 #4 [ffff883eb683f900] ldlm_cli_enqueue_local at ffffffffa0becb53 [ptlrpc]
 #5 [ffff883eb683f998] mdt_object_local_lock at ffffffffa1248d72 [mdt]
 #6 [ffff883eb683fa48] mdt_object_lock_internal at ffffffffa12493be [mdt]
 #7 [ffff883eb683fa98] mdt_getattr_name_lock at ffffffffa124a164 [mdt]
 #8 [ffff883eb683fb20] mdt_intent_getattr at ffffffffa124b480 [mdt]
 #9 [ffff883eb683fb60] mdt_intent_opc at ffffffffa124746b [mdt]
#10 [ffff883eb683fbc0] mdt_intent_policy at ffffffffa124fd08 [mdt]
#11 [ffff883eb683fbf8] ldlm_lock_enqueue at ffffffffa0bd12cd [ptlrpc]
#12 [ffff883eb683fc50] ldlm_handle_enqueue0 at ffffffffa0bfac13 [ptlrpc]
#13 [ffff883eb683fce0] tgt_enqueue at ffffffffa0c80622 [ptlrpc]
#14 [ffff883eb683fd00] tgt_request_handle at ffffffffa0c8428a [ptlrpc]
#15 [ffff883eb683fd48] ptlrpc_server_handle_request at ffffffffa0c2c6cb [ptlrpc]
#16 [ffff883eb683fde8] ptlrpc_main at ffffffffa0c306b2 [ptlrpc]
#17 [ffff883eb683fec8] kthread at ffffffff810b1131
#18 [ffff883eb683ff50] ret_from_fork at ffffffff816a14f7


 Comments   
Comment by Peter Jones [ 30/May/19 ]

Hongchao

Could you please investigate?

Thanks

Peter

Comment by Patrick Farrell (Inactive) [ 30/May/19 ]

This is not related to LU-11359, as that is a DOM specific issue.

It would be good to have dmesg from the MDS and also the dumped lustre log(s), like the one listed here:

[2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570
Comment by Mahmoud Hanafi [ 31/May/19 ]

Attaching dmesg. We don't have /tmp/lustre-log.1557763688.34570 it was deleted. vmcore-dmesg.txt.gz

Comment by Aurelien Degremont (Inactive) [ 04/Jun/19 ]

@mhanafi, there is plenty of stack traces in your vmcode logs related to stuck read I/O like:

[2415376.231996] Pid: 35506, comm: mdt_rdpg00_051 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019
[2415376.231998] Call Trace:
[2415376.232023] [<ffffffff8123499a>] __wait_on_buffer+0x2a/0x30
[2415376.249546] [<ffffffffa0ee553c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
[2415376.249554] [<ffffffffa0ed3e7a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
[2415376.249560] [<ffffffffa0ed4270>] htree_dirblock_to_tree+0x40/0x190 [ldiskfs]
[2415376.249566] [<ffffffffa0ed6451>] ldiskfs_htree_fill_tree+0x201/0x2f0 [ldiskfs]

Are you sure you do not have any disk issue? This could be a side effect but you should first verify there is not a disk issue that prevent Lustre threads from doing I/O.

Comment by Hongchao Zhang [ 05/Jun/19 ]

As per the logs, there is an IBITS lock on root directory is holding by client at 10.151.10.140, but there is network error
when sending the Blocking callback to the client

2418411.342868] LNetError: 21360:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 5 seconds
[2418411.373829] LNetError: 21360:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.10.140@o2ib (207): c: 54, oc: 0, rc: 63
[2418411.411904] Lustre: 33336:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1557763288/real 1557763393]  req@ffff880139dd6600 x1630930117996992/t0(0) o104->nbp2-MDT0000@10.151.10.140@o2ib:15/16 lens 296/224 e 0 to 1 dl 1557763743 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1

After 407 seconds, the lock waiting timer is expired and released, so all other lock enqueue request are blocked
in ldlm_completion_ast to wait its lock to be granted.

[2418713.129522] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 407s: evicting client at 10.151.10.140@o2ib  ns: mdt-nbp2-MDT0000_UUID lock: ffff8810b458a200/0x843e968fdb2cd3cc lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 2189 type: IBT flags: 0x60200400000020 nid: 10.151.10.140@o2ib remote: 0x7e3f2695c7cccb4e expref: 160144 pid: 33420 timeout: 6713595267 lvb_type: 0

What is the output of "lctl get_param at_max", "lctl get_param timeout"?
This issue can be mitigated by reducing the timeout, but it could cause some operations to be failed if it spent more
time to be processed, especially under high load situation.

Comment by Mahmoud Hanafi [ 06/Jun/19 ]
  1. lctl get_param at_max
    at_max=600
  1. # lctl get_param timeout
    timeout=100
Comment by Mahmoud Hanafi [ 06/Jan/21 ]

We can close this for now

Generated at Sat Feb 10 02:51:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.