[LU-15800] Fallocate causes transaction deadlock Created: 28/Apr/22 Updated: 02/Aug/23 Resolved: 06/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andriy Skulysh | Assignee: | Arshad Hussain |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ost | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
PID: 74368 TASK: ffff9600eaeac740 CPU: 9 COMMAND: "ll_ost_io02_069"
#0 [ffffa3f1a7a57830] __schedule at ffffffff9034e1d4
#1 [ffffa3f1a7a578c8] schedule at ffffffff9034e648
#2 [ffffa3f1a7a578d8] rwsem_down_read_slowpath at ffffffff903511d0
#3 [ffffa3f1a7a57978] osd_read_lock at ffffffffc1a3379d [osd_ldiskfs]
<-- rc = dt_trans_start_local(env, ofd->ofd_osd , th);
ofd_read_lock(env, ofd_obj);
#4 [ffffa3f1a7a57998] ofd_write_attr_set at ffffffffc186b6cc [ofd]
#5 [ffffa3f1a7a57a00] ofd_commitrw_write at ffffffffc186c812 [ofd]
#6 [ffffa3f1a7a57aa0] ofd_commitrw at ffffffffc18721f1 [ofd]
#7 [ffffa3f1a7a57b60] finish_wait at ffffffff8fb2e5ac
#8 [ffffa3f1a7a57bd8] tgt_brw_write at ffffffffc1255544 [ptlrpc]
PID: 73559 TASK: ffff9601653a97c0 CPU: 11 COMMAND: "ll_ost02_046"
#0 [ffffa3f1a0817970] __schedule at ffffffff9034e1d4
#1 [ffffa3f1a0817a08] schedule at ffffffff9034e648
#2 [ffffa3f1a0817a18] wait_transaction_locked at ffffffffc0ad2089 [jbd2]
#3 [ffffa3f1a0817a68] add_transaction_credits at ffffffffc0ad21c4 [jbd2]
#4 [ffffa3f1a0817ac0] start_this_handle at ffffffffc0ad250a [jbd2]
#5 [ffffa3f1a0817b40] jbd2__journal_restart at ffffffffc0ad2ad0 [jbd2]
#6 [ffffa3f1a0817b80] osd_fallocate_preallocate at ffffffffc1a5b6d2 [osd_ldiskfs]
#7 [ffffa3f1a0817c18] osd_fallocate at ffffffffc1a5b98d [osd_ldiskfs]
<-- ofd_trans_start(env, ofd, fo, th);
ofd_write_lock(env, fo);
#8 [ffffa3f1a0817c50] ofd_object_fallocate at ffffffffc18682f9 [ofd]
#9 [ffffa3f1a0817cb8] ofd_fallocate_hdl at ffffffffc185912f [ofd]
#10 [ffffa3f1a0817d50] tgt_request_handle at ffffffffc1256a53 [ptlrpc]
The deadlock was added by : Commit: 93f700ca241a98630fc5ff19a041e35fbdbf0385 Author: Arshad Hussain <arshad.super@gmail.com> Committer: Oleg Drokin <green@whamcloud.com> Author Date: Thu 10 Sep 2020 02:18:13 AM EEST Committer Date: Thu 29 Oct 2020 06:28:42 AM EET LU-13765 osd-ldiskfs: Extend credit correctly for fallocate |
| Comments |
| Comment by Andriy Skulysh [ 29/Apr/22 ] |
|
jhammond, it isn't a duplicate of |
| Comment by Peter Jones [ 03/May/22 ] |
|
Arshad Is this something that you are able to look into? Peter |
| Comment by Arshad Hussain [ 03/May/22 ] |
|
Hi Peter, I am looking into it. Thanks |
| Comment by Arshad Hussain [ 04/May/22 ] |
|
Andriy, Is there a test case(or manual steps) that can trigger this issue? Environment details would also help. (how large was the fallocate?) At least in my case, It would greatly help to have such details/reproducer. I tried to reproduce the bug running standard sanity/sanityn test-case over loop but failed to reproduce the deadlock. At-least the standard test-case does not catch/trigger this. With the stack trace you provided, it looks like there are two threads one doing fallocate(standard prealloc) other doing a write(eg dd). ? >...in the code but it is violated by jbd2__journal_restart(). It shouldn't be called under ofd_write_lock() Sorry, I thought this is fine under ofd lock. Can you please explain in more details? osd_extend_restart_trans() ->ldiskfs_journal_restart() -> jbd2_journal_restart() -> jbd2__journal_restart()
Thanks |
| Comment by Peter Jones [ 05/May/22 ] |
|
Given that this issue existed in 2.14, I think that it should be ok to descope it from 2.15.0 and include in a future 2.15.x maintenance release. |
| Comment by Gerrit Updater [ 10/May/22 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47268 |
| Comment by Gerrit Updater [ 06/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47268/ |
| Comment by Peter Jones [ 06/Jun/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 18/Jul/23 ] |
|
"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51702 |
| Comment by Stephane Thiell [ 01/Aug/23 ] |
|
We also hit this OSS deadlock with Lustre 2.15.3 yesterday. The backtraces seem to match: PID: 61557 TASK: ffff919ef006a100 CPU: 39 COMMAND: "ll_ost_io00_078" #0 [ffff919cea263778] __schedule at ffffffff92db78d8 #1 [ffff919cea2637e0] schedule at ffffffff92db7ca9 #2 [ffff919cea2637f0] rwsem_down_read_failed at ffffffff92db9705 #3 [ffff919cea263878] call_rwsem_down_read_failed at ffffffff929ae568 #4 [ffff919cea2638c8] down_read at ffffffff92db7120 #5 [ffff919cea2638e0] osd_read_lock at ffffffffc16d4e7c [osd_ldiskfs] #6 [ffff919cea263908] ofd_write_attr_set at ffffffffc1863129 [ofd] #7 [ffff919cea263978] ofd_commitrw_write at ffffffffc1863fd2 [ofd] #8 [ffff919cea263a30] ofd_commitrw at ffffffffc18698e0 [ofd] #9 [ffff919cea263ac0] tgt_brw_write at ffffffffc140c695 [ptlrpc] #10 [ffff919cea263ca8] tgt_request_handle at ffffffffc140f25f [ptlrpc] #11 [ffff919cea263d38] ptlrpc_server_handle_request at ffffffffc13b8aa3 [ptlrpc] #12 [ffff919cea263df0] ptlrpc_main at ffffffffc13ba734 [ptlrpc] #13 [ffff919cea263ec8] kthread at ffffffff926cb621 #14 [ffff919cea263f50] ret_from_fork_nospec_begin at ffffffff92dc51dd PID: 40363 TASK: ffff915f2d945280 CPU: 10 COMMAND: "ll_ost00_123" #0 [ffff9159d062f8f0] __schedule at ffffffff92db78d8 #1 [ffff9159d062f958] schedule at ffffffff92db7ca9 #2 [ffff9159d062f968] wait_transaction_locked at ffffffffc03ca085 [jbd2] #3 [ffff9159d062f9c0] add_transaction_credits at ffffffffc03ca378 [jbd2] #4 [ffff9159d062fa20] start_this_handle at ffffffffc03ca601 [jbd2] #5 [ffff9159d062fab8] jbd2__journal_restart at ffffffffc03cacf2 [jbd2] #6 [ffff9159d062faf8] jbd2_journal_restart at ffffffffc03cad63 [jbd2] #7 [ffff9159d062fb08] osd_extend_restart_trans at ffffffffc1700d8c [osd_ldiskfs] #8 [ffff9159d062fb28] osd_fallocate at ffffffffc1702dc4 [osd_ldiskfs] #9 [ffff9159d062fbb0] ofd_object_fallocate at ffffffffc185fb4f [ofd] #10 [ffff9159d062fc18] ofd_fallocate_hdl at ffffffffc1848835 [ofd] #11 [ffff9159d062fca8] tgt_request_handle at ffffffffc140f25f [ptlrpc] #12 [ffff9159d062fd38] ptlrpc_server_handle_request at ffffffffc13b8aa3 [ptlrpc] #13 [ffff9159d062fdf0] ptlrpc_main at ffffffffc13ba734 [ptlrpc] #14 [ffff9159d062fec8] kthread at ffffffff926cb621 #15 [ffff9159d062ff50] ret_from_fork_nospec_begin at ffffffff92dc51dd We will try the proposed patch (thanks!). |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51702/ |