[LU-5368] errors in/from ldlm_run_ast_work() ignored Created: 18/Jul/14  Updated: 20/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: ldlm

Issue Links:
Related
is related to LU-13984 Failing to send lock callbacks keeps ... Resolved
Severity: 3
Rank (Obsolete): 14972

 Description   

In ldlm_run_ast_work() errors from ptlrpc_set_wait() are not returned to the caller. In ldlm_process_inodebits_lock() error returns (other than -ERESTART) from ldlm_run_ast_work() are ignored.

Oleg and I discussed this a bit:

[14:20:47] John Hammond: In ldlm_run_ast_work() we ignore errors from ptlrpc. Is this intentional, unintentional, or other?
[14:21:32] John Hammond: Also the callers of ldlm_run_ast_work() often do not propagate its errors.
[14:21:46] Oleg Drokin: the idea is that we cannot do anything about it.
[14:21:53] Oleg Drokin: there was some patch somewhere t od oresends
[14:22:06] John Hammond: Yes. Looking at that now.
[14:22:20] John Hammond: You mean http://review.whamcloud.com/#/c/9335/ right?
[14:22:45] Oleg Drokin: yes, that would be part of this
[14:24:59] John Hammond: My thought was: If there is an error in ptlrpc then currently the handler just gets stuck in ldlm_completion_ast(). Wouldn't it be better to return an error back to the client in this situation?
[14:48:33] Oleg Drokin: it gets stuck?
[14:49:31] John Hammond: Sure. Waiting for the lock to be granted.
[14:53:07] Oleg Drokin: we probably should just call failed_ast on the spot for that particular lock to evict entire client. returning an error does not tell us much because there might be more than one lock blocking granting ofthis one and such



 Comments   
Comment by Mikhail Pershin [ 14/Dec/17 ]

it is not quite clear should we do anything about that or not. In case of error lock will be timed out after all and client - evicted, in that sense there is nothing to do. What can be changed here - If ptlrpc_set_wait() fails then we can repeat sending attempt again and again until timeout will happen or send will succeed OR we can evict client on such failure immediately. Probably it is worth to implement both actions depending on error code

Comment by Mahmoud Hanafi [ 26/Feb/19 ]

In LU-11644 we are getting task traces like this could this be related it?

an 28 00:10:58 nbp2-oss17 kernel: [205777.962494] LNet: Skipped 3 previous similar messages
Jan 28 00:10:58 nbp2-oss17 kernel: [205777.977865] Pid: 19032, comm: ll_ost00_146 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Jan 28 00:10:58 nbp2-oss17 kernel: [205777.977867] Call Trace:
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.977876]  [<ffffffffa0bcd0b0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982409]  [<ffffffffa0b8ae53>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982438]  [<ffffffffa0baba8b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982448]  [<ffffffffa10f98a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982474]  [<ffffffffa0b8a2cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982505]  [<ffffffffa0bb3b23>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982547]  [<ffffffffa0c39232>] tgt_enqueue+0x62/0x210 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982583]  [<ffffffffa0c3ce9a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982615]  [<ffffffffa0be548b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982647]  [<ffffffffa0be9472>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982651]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982653]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982670]  [<ffffffffffffffff>] 0xffffffffffffffff
Jan 28 00:11:00 nbp2-oss17 kernel: [205777.982672] LustreError: dumping log to /tmp/lustre-log.1548663058.19032
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.215909] Pid: 19019, comm: ll_ost00_138 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.215909] Call Trace:
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.215974]  [<ffffffffa0bcd0b0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216000]  [<ffffffffa0b8ae53>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216028]  [<ffffffffa0baba8b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216039]  [<ffffffffa10f98a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216064]  [<ffffffffa0b8a2cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216093]  [<ffffffffa0bb3b23>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216135]  [<ffffffffa0c39232>] tgt_enqueue+0x62/0x210 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216171]  [<ffffffffa0c3ce9a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216204]  [<ffffffffa0be548b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216236]  [<ffffffffa0be9472>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216239]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216242]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216259]  [<ffffffffffffffff>] 0xffffffffffffffff
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216262] Pid: 18757, comm: ll_ost00_035 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216262] Call Trace:
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216297]  [<ffffffffa0bcd0b0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216322]  [<ffffffffa0b8ae53>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216350]  [<ffffffffa0baba8b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216356]  [<ffffffffa10f98a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216382]  [<ffffffffa0b8a2cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216410]  [<ffffffffa0bb3b23>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216445]  [<ffffffffa0c39232>] tgt_enqueue+0x62/0x210 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216480]  [<ffffffffa0c3ce9a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216513]  [<ffffffffa0be548b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216539]  [<ffffffffa0be9472>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216541]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216542]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216546]  [<ffffffffffffffff>] 0xffffffffffffffff
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216548] Pid: 19228, comm: ll_ost00_238 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216549] Call Trace:
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216578]  [<ffffffffa0bcd0b0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216598]  [<ffffffffa0b8ae53>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216620]  [<ffffffffa0baba8b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216625]  [<ffffffffa10f98a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216644]  [<ffffffffa0b8a2cd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216667]  [<ffffffffa0bb3b23>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216697]  [<ffffffffa0c39232>] tgt_enqueue+0x62/0x210 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216727]  [<ffffffffa0c3ce9a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216752]  [<ffffffffa0be548b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216777]  [<ffffffffa0be9472>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216779]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216780]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Jan 28 00:11:00 nbp2-oss17 kernel: [205778.216784]  [<ffffffffffffffff>] 0xffffffffffffffff

Generated at Sat Feb 10 01:50:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.