[LU-12354] MDT threads stuck at ldlm_completion_ast - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Incomplete
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.7
Labels:
None

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

MDT with very high load and lots of threads stuck at ldlm_completion_ast and
osp_sync_process_queues.

[2418705.962173] LNet: Service thread pid 34570 was inactive for 400.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[2418706.013418] Pid: 34570, comm: mdt01_236 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019
[2418706.013419] Call Trace:
[2418706.013433]  [<ffffffffa0beba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[2418706.034654]  [<ffffffffa0becb53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[2418706.034673]  [<ffffffffa1248d72>] mdt_object_local_lock+0x512/0xb00 [mdt]
[2418706.034680]  [<ffffffffa12493be>] mdt_object_lock_internal+0x5e/0x300 [mdt]
[2418706.034687]  [<ffffffffa124a164>] mdt_getattr_name_lock+0x8a4/0x1910 [mdt]
[2418706.034694]  [<ffffffffa124b480>] mdt_intent_getattr+0x2b0/0x480 [mdt]
[2418706.034701]  [<ffffffffa124746b>] mdt_intent_opc+0x1eb/0xaf0 [mdt]
[2418706.034708]  [<ffffffffa124fd08>] mdt_intent_policy+0x138/0x320 [mdt]
[2418706.034727]  [<ffffffffa0bd12cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
[2418706.034751]  [<ffffffffa0bfac13>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc]
[2418706.034787]  [<ffffffffa0c80622>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2418706.034818]  [<ffffffffa0c8428a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[2418706.034845]  [<ffffffffa0c2c6cb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[2418706.034870]  [<ffffffffa0c306b2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[2418706.034873]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[2418706.034875]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
[2418706.034894]  [<ffffffffffffffff>] 0xffffffffffffffff
[2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570

PID: 32153  TASK: ffff883ed28d5ee0  CPU: 10  COMMAND: "osp-syn-153-0"
 #0 [ffff883ed292f908] __schedule at ffffffff816946af
 #1 [ffff883ed292f990] schedule at ffffffff81694cd9
 #2 [ffff883ed292f9a0] osp_sync_process_queues at ffffffffa13f8197 [osp]
 #3 [ffff883ed292fab0] llog_process_thread at ffffffffa09ad8bb [obdclass]
 #4 [ffff883ed292fb78] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
 #5 [ffff883ed292fbc0] llog_cat_process_cb at ffffffffa09b3dad [obdclass]
 #6 [ffff883ed292fc30] llog_process_thread at ffffffffa09ad8bb [obdclass]
 #7 [ffff883ed292fcf8] llog_process_or_fork at ffffffffa09ae5cc [obdclass]
 #8 [ffff883ed292fd40] llog_cat_process_or_fork at ffffffffa09b2e49 [obdclass]
 #9 [ffff883ed292fdb8] llog_cat_process at ffffffffa09b2f7e [obdclass]
#10 [ffff883ed292fdd8] osp_sync_thread at ffffffffa13f63af [osp]
#11 [ffff883ed292fec8] kthread at ffffffff810b1131
#12 [ffff883ed292ff50] ret_from_fork at ffffffff816a14f7

And

PID: 33442  TASK: ffff881fcfa38000  CPU: 21  COMMAND: "mdt01_073"
 #0 [ffff883eb683f718] __schedule at ffffffff816946af
 #1 [ffff883eb683f7a8] schedule at ffffffff81694cd9
 #2 [ffff883eb683f7b8] schedule_timeout at ffffffff81692574
 #3 [ffff883eb683f860] ldlm_completion_ast at ffffffffa0beba11 [ptlrpc]
 #4 [ffff883eb683f900] ldlm_cli_enqueue_local at ffffffffa0becb53 [ptlrpc]
 #5 [ffff883eb683f998] mdt_object_local_lock at ffffffffa1248d72 [mdt]
 #6 [ffff883eb683fa48] mdt_object_lock_internal at ffffffffa12493be [mdt]
 #7 [ffff883eb683fa98] mdt_getattr_name_lock at ffffffffa124a164 [mdt]
 #8 [ffff883eb683fb20] mdt_intent_getattr at ffffffffa124b480 [mdt]
 #9 [ffff883eb683fb60] mdt_intent_opc at ffffffffa124746b [mdt]
#10 [ffff883eb683fbc0] mdt_intent_policy at ffffffffa124fd08 [mdt]
#11 [ffff883eb683fbf8] ldlm_lock_enqueue at ffffffffa0bd12cd [ptlrpc]
#12 [ffff883eb683fc50] ldlm_handle_enqueue0 at ffffffffa0bfac13 [ptlrpc]
#13 [ffff883eb683fce0] tgt_enqueue at ffffffffa0c80622 [ptlrpc]
#14 [ffff883eb683fd00] tgt_request_handle at ffffffffa0c8428a [ptlrpc]
#15 [ffff883eb683fd48] ptlrpc_server_handle_request at ffffffffa0c2c6cb [ptlrpc]
#16 [ffff883eb683fde8] ptlrpc_main at ffffffffa0c306b2 [ptlrpc]
#17 [ffff883eb683fec8] kthread at ffffffff810b1131
#18 [ffff883eb683ff50] ret_from_fork at ffffffff816a14f7

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

vmcore-dmesg.txt.gz
203 kB
31/May/19 9:55 PM

Issue Links

is related to

LU-14611 racer test 1 hangs in ls/locking

Resolved

Activity

[LU-12354] MDT threads stuck at ldlm_completion_ast

Mahmoud Hanafi added a comment - 06/Jan/21 1:44 AM

We can close this for now

Mahmoud Hanafi added a comment - 06/Jan/21 1:44 AM We can close this for now

Mahmoud Hanafi added a comment - 06/Jun/19 9:02 PM

lctl get_param at_max
at_max=600

# lctl get_param timeout
timeout=100

Mahmoud Hanafi added a comment - 06/Jun/19 9:02 PM lctl get_param at_max at_max=600 # lctl get_param timeout timeout=100

Hongchao Zhang added a comment - 05/Jun/19 12:28 PM - edited

As per the logs, there is an IBITS lock on root directory is holding by client at 10.151.10.140, but there is network error
when sending the Blocking callback to the client

2418411.342868] LNetError: 21360:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 5 seconds
[2418411.373829] LNetError: 21360:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.10.140@o2ib (207): c: 54, oc: 0, rc: 63
[2418411.411904] Lustre: 33336:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1557763288/real 1557763393]  req@ffff880139dd6600 x1630930117996992/t0(0) o104->nbp2-MDT0000@10.151.10.140@o2ib:15/16 lens 296/224 e 0 to 1 dl 1557763743 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1

After 407 seconds, the lock waiting timer is expired and released, so all other lock enqueue request are blocked
in ldlm_completion_ast to wait its lock to be granted.

[2418713.129522] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 407s: evicting client at 10.151.10.140@o2ib  ns: mdt-nbp2-MDT0000_UUID lock: ffff8810b458a200/0x843e968fdb2cd3cc lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 2189 type: IBT flags: 0x60200400000020 nid: 10.151.10.140@o2ib remote: 0x7e3f2695c7cccb4e expref: 160144 pid: 33420 timeout: 6713595267 lvb_type: 0

What is the output of "lctl get_param at_max", "lctl get_param timeout"?
This issue can be mitigated by reducing the timeout, but it could cause some operations to be failed if it spent more
time to be processed, especially under high load situation.

Hongchao Zhang added a comment - 05/Jun/19 12:28 PM - edited As per the logs, there is an IBITS lock on root directory is holding by client at 10.151.10.140, but there is network error when sending the Blocking callback to the client 2418411.342868] LNetError: 21360:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 5 seconds [2418411.373829] LNetError: 21360:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 10.151.10.140@o2ib (207): c: 54, oc: 0, rc: 63 [2418411.411904] Lustre: 33336:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1557763288/real 1557763393] req@ffff880139dd6600 x1630930117996992/t0(0) o104->nbp2-MDT0000@10.151.10.140@o2ib:15/16 lens 296/224 e 0 to 1 dl 1557763743 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 After 407 seconds, the lock waiting timer is expired and released, so all other lock enqueue request are blocked in ldlm_completion_ast to wait its lock to be granted. [2418713.129522] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 407s: evicting client at 10.151.10.140@o2ib ns: mdt-nbp2-MDT0000_UUID lock: ffff8810b458a200/0x843e968fdb2cd3cc lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 2189 type: IBT flags: 0x60200400000020 nid: 10.151.10.140@o2ib remote: 0x7e3f2695c7cccb4e expref: 160144 pid: 33420 timeout: 6713595267 lvb_type: 0 What is the output of "lctl get_param at_max", "lctl get_param timeout"? This issue can be mitigated by reducing the timeout, but it could cause some operations to be failed if it spent more time to be processed, especially under high load situation.

Aurelien Degremont (Inactive) added a comment - 04/Jun/19 12:30 PM

@mhanafi, there is plenty of stack traces in your vmcode logs related to stuck read I/O like:

[2415376.231996] Pid: 35506, comm: mdt_rdpg00_051 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019
[2415376.231998] Call Trace:
[2415376.232023] [<ffffffff8123499a>] __wait_on_buffer+0x2a/0x30
[2415376.249546] [<ffffffffa0ee553c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
[2415376.249554] [<ffffffffa0ed3e7a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
[2415376.249560] [<ffffffffa0ed4270>] htree_dirblock_to_tree+0x40/0x190 [ldiskfs]
[2415376.249566] [<ffffffffa0ed6451>] ldiskfs_htree_fill_tree+0x201/0x2f0 [ldiskfs]

Are you sure you do not have any disk issue? This could be a side effect but you should first verify there is not a disk issue that prevent Lustre threads from doing I/O.

Aurelien Degremont (Inactive) added a comment - 04/Jun/19 12:30 PM @mhanafi, there is plenty of stack traces in your vmcode logs related to stuck read I/O like: [2415376.231996] Pid: 35506, comm: mdt_rdpg00_051 3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 #1 SMP Wed Jan 30 00:30:34 UTC 2019 [2415376.231998] Call Trace: [2415376.232023] [<ffffffff8123499a>] __wait_on_buffer+0x2a/0x30 [2415376.249546] [<ffffffffa0ee553c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs] [2415376.249554] [<ffffffffa0ed3e7a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs] [2415376.249560] [<ffffffffa0ed4270>] htree_dirblock_to_tree+0x40/0x190 [ldiskfs] [2415376.249566] [<ffffffffa0ed6451>] ldiskfs_htree_fill_tree+0x201/0x2f0 [ldiskfs] Are you sure you do not have any disk issue? This could be a side effect but you should first verify there is not a disk issue that prevent Lustre threads from doing I/O.

Mahmoud Hanafi added a comment - 31/May/19 9:56 PM

Attaching dmesg. We don't have /tmp/lustre-log.1557763688.34570 it was deleted. vmcore-dmesg.txt.gz

Mahmoud Hanafi added a comment - 31/May/19 9:56 PM Attaching dmesg. We don't have /tmp/lustre-log.1557763688.34570 it was deleted. vmcore-dmesg.txt.gz

Patrick Farrell (Inactive) added a comment - 30/May/19 2:15 PM

This is not related to ~~LU-11359~~, as that is a DOM specific issue.

It would be good to have dmesg from the MDS and also the dumped lustre log(s), like the one listed here:

[2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570

Patrick Farrell (Inactive) added a comment - 30/May/19 2:15 PM This is not related to LU-11359 , as that is a DOM specific issue. It would be good to have dmesg from the MDS and also the dumped lustre log(s), like the one listed here: [2418706.034897] LustreError: dumping log to /tmp/lustre-log.1557763688.34570

Peter Jones added a comment - 30/May/19 2:14 PM

Hongchao

Could you please investigate?

Thanks

Peter

Peter Jones added a comment - 30/May/19 2:14 PM Hongchao Could you please investigate? Thanks Peter

People

Assignee:: Hongchao Zhang

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 29/May/19 6:42 PM

Updated:: 06/Mar/23 3:55 PM

Resolved:: 06/Jan/21 1:33 PM