[LU-16966] ofd_object_fallocate dead lock? Created: 15/Jul/23 Updated: 17/Nov/23 Resolved: 16/Oct/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.2 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre version: kernel: 4.18.0-425.3.1.el8_lustre.x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have had multiple servers get dead lock with this stack trace. (attached longer console output) Jul 15 05:46:28 nbp11-srv3 kernel: INFO: task ll_ost07_000:9230 blocked for more than 120 seconds. Jul 15 05:46:28 nbp11-srv3 kernel: Tainted: G OE --------- - - 4.18.0-425.3.1.el8_lustre.x86_64 #1 Jul 15 05:46:28 nbp11-srv3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 15 05:46:28 nbp11-srv3 kernel: task:ll_ost07_000 state:D stack: 0 pid: 9230 ppid: 2 flags:0x80004080 Jul 15 05:46:28 nbp11-srv3 kernel: Call Trace: Jul 15 05:46:28 nbp11-srv3 kernel: __schedule+0x2d1/0x860 Jul 15 05:46:28 nbp11-srv3 kernel: schedule+0x35/0xa0 Jul 15 05:46:28 nbp11-srv3 kernel: wait_transaction_locked+0x89/0xd0 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: ? finish_wait+0x80/0x80 Jul 15 05:46:28 nbp11-srv3 kernel: add_transaction_credits+0xd4/0x290 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_do_update_inode+0x604/0x800 [ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: start_this_handle+0x10a/0x520 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: ? osd_fallocate_preallocate.isra.38+0x275/0x760 [osd_ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_mark_iloc_dirty+0x32/0x90 [ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: jbd2__journal_restart+0xb4/0x160 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate_preallocate.isra.38+0x5a6/0x760 [osd_ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate+0xfd/0x370 [osd_ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: ofd_object_fallocate+0x5dd/0xa30 [ofd] Jul 15 05:46:28 nbp11-srv3 kernel: ofd_fallocate_hdl+0x467/0x730 [ofd] Jul 15 05:46:28 nbp11-srv3 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: kthread+0x10a/0x120 Jul 15 05:46:28 nbp11-srv3 kernel: ? set_kthread_struct+0x50/0x50 Jul 15 05:46:28 nbp11-srv3 kernel: ret_from_fork+0x1f/0x40 |
| Comments |
| Comment by Alex Zhuravlev [ 17/Jul/23 ] |
|
looks like a duplicate of |
| Comment by Peter Jones [ 17/Jul/23 ] |
|
Thanks Alex! Can you please port the fox from Thanks Peter |
| Comment by Xing Huang [ 18/Jul/23 ] |
|
Yes, Peter, I'll today. |
| Comment by Mahmoud Hanafi [ 03/Aug/23 ] |
|
We applied Aug 2 23:09:33 nbp15-srv1 kernel: INFO: task ll_ost02_011:14622 blocked for more than 120 seconds. Aug 2 23:09:33 nbp15-srv1 kernel: Tainted: G OE --------- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1 Aug 2 23:09:33 nbp15-srv1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 2 23:09:33 nbp15-srv1 kernel: task:ll_ost02_011 state:D stack: 0 pid:14622 ppid: 2 flags:0x80004080 Aug 2 23:09:33 nbp15-srv1 kernel: Call Trace: Aug 2 23:09:33 nbp15-srv1 kernel: __schedule+0x2d1/0x870 Aug 2 23:09:33 nbp15-srv1 kernel: schedule+0x55/0xf0 Aug 2 23:09:33 nbp15-srv1 kernel: wait_transaction_locked+0x89/0xd0 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? finish_wait+0x80/0x80 Aug 2 23:09:33 nbp15-srv1 kernel: add_transaction_credits+0xd4/0x290 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? osd_declare_qid+0x398/0x4c0 [osd_ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: start_this_handle+0x10a/0x520 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? jbd2__journal_start+0x8f/0x1f0 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? kmem_cache_alloc+0x13f/0x280 Aug 2 23:09:33 nbp15-srv1 kernel: jbd2__journal_start+0xee/0x1f0 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? osd_trans_start+0x13b/0x500 [osd_ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: osd_trans_start+0x13b/0x500 [osd_ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: ofd_attr_set+0x546/0x1090 [ofd] Aug 2 23:09:33 nbp15-srv1 kernel: ofd_setattr_hdl+0x458/0x8e0 [ofd] Aug 2 23:09:33 nbp15-srv1 kernel: tgt_request_handle+0xccd/0x1b10 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: kthread+0x134/0x150 Aug 2 23:09:33 nbp15-srv1 kernel: ? set_kthread_struct+0x50/0x50 Aug 2 23:09:33 nbp15-srv1 kernel: ret_from_fork+0x1f/0x40 Aug 2 23:09:33 nbp15-srv1 kernel: INFO: task ll_ost03_006:14631 blocked for more than 120 seconds. Aug 2 23:09:33 nbp15-srv1 kernel: Tainted: G OE --------- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1 Aug 2 23:09:33 nbp15-srv1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 2 23:09:33 nbp15-srv1 kernel: task:ll_ost03_006 state:D stack: 0 pid:14631 ppid: 2 flags:0x80004080 Aug 2 23:09:33 nbp15-srv1 kernel: Call Trace: Aug 2 23:09:33 nbp15-srv1 kernel: __schedule+0x2d1/0x870 Aug 2 23:09:33 nbp15-srv1 kernel: schedule+0x55/0xf0 Aug 2 23:09:33 nbp15-srv1 kernel: wait_transaction_locked+0x89/0xd0 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? finish_wait+0x80/0x80 Aug 2 23:09:33 nbp15-srv1 kernel: add_transaction_credits+0xd4/0x290 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? ldiskfs_do_update_inode+0x604/0x800 [ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: start_this_handle+0x10a/0x520 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: ? osd_fallocate_preallocate.isra.37+0x275/0x760 [osd_ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: ? ldiskfs_mark_iloc_dirty+0x32/0x90 [ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: jbd2__journal_restart+0xb4/0x160 [jbd2] Aug 2 23:09:33 nbp15-srv1 kernel: osd_fallocate_preallocate.isra.37+0x5a6/0x760 [osd_ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: osd_fallocate+0xfd/0x370 [osd_ldiskfs] Aug 2 23:09:33 nbp15-srv1 kernel: ofd_object_fallocate+0x5dc/0xa30 [ofd] Aug 2 23:09:33 nbp15-srv1 kernel: ofd_fallocate_hdl+0x467/0x730 [ofd] Aug 2 23:09:33 nbp15-srv1 kernel: tgt_request_handle+0xccd/0x1b10 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] Aug 2 23:09:33 nbp15-srv1 kernel: kthread+0x134/0x150 Aug 2 23:09:33 nbp15-srv1 kernel: ? set_kthread_struct+0x50/0x50 Aug 2 23:09:33 nbp15-srv1 kernel: ret_from_fork+0x1f/0x40 |
| Comment by Mahmoud Hanafi [ 04/Aug/23 ] |
|
This looks more like
|
| Comment by Mahmoud Hanafi [ 04/Aug/23 ] |
|
Filesystem hung again. attaching stack output.
|
| Comment by Alex Zhuravlev [ 04/Aug/23 ] |
thanks, need some time to analyze the traces. |
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
I noticed OOM:
Aug 2 20:55:33 nbp15-srv1 kernel: mdt_rdpg02_001: page allocation failure: order:0, mode:0x40(__GFP_IO), nodemask=(null),cpuset=/,mems_allowed=0
|
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
|
can you please show output for lctl get_param osd*.OST.brw_stats from that OST? |
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
|
could you please clarify what exact version you're running? the ticket says 2.15.2, but you also siad |
| Comment by Peter Jones [ 07/Aug/23 ] |
|
Alex I would expect that the answer will be the b2_15 branch from https://github.com/champios/lustre-nas Peter |
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
|
thanks Peter |
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
|
b2_15-nas is the only "15" one, right? |
| Comment by Mahmoud Hanafi [ 07/Aug/23 ] |
|
Initially we were running wc 2.15.2. After we got the patch we were running 2.15.3 + https://github.com/champios/lustre-nas) btw_stats attached. |
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
|
looking at the code in that tree I noticed couple things: |
| Comment by Alex Zhuravlev [ 07/Aug/23 ] |
unfortunately, |
| Comment by Mahmoud Hanafi [ 08/Aug/23 ] |
|
Is there some additional debugging we can do to help with this issue. |
| Comment by Alex Zhuravlev [ 09/Aug/23 ] |
would it be possible to apply |
| Comment by Mahmoud Hanafi [ 09/Aug/23 ] |
|
We are going to do a build with We would need a backport for |
| Comment by Mahmoud Hanafi [ 16/Aug/23 ] |
|
Still waiting for answer to above question. |
| Comment by Colin Faber [ 17/Aug/23 ] |
|
Hi mhanafi |
| Comment by Mahmoud Hanafi [ 28/Aug/23 ] |
|
We were able to get some of our servers patched with
|
| Comment by Mahmoud Hanafi [ 03/Sep/23 ] |
|
Can we get an update please. We have filesystem hanging multiple times a day. |
| Comment by Alex Zhuravlev [ 03/Sep/23 ] |
mhanafisorry for the delay, I'm trying to reconstruct the problem using brw_stats you provided. |
| Comment by Alex Zhuravlev [ 03/Sep/23 ] |
|
mhanafi could you please apply the patch just added? the patch reverts range locking. |
| Comment by Mahmoud Hanafi [ 03/Sep/23 ] |
|
Thanks we'll get the patch applied this week and let you know the results. |
| Comment by Alex Zhuravlev [ 04/Sep/23 ] |
|
mhanafi I'm very very sorry, but ... |
| Comment by Gerrit Updater [ 04/Sep/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52264 |
| Comment by Mahmoud Hanafi [ 25/Sep/23 ] |
|
We applied the patch provided and we have not seeing the issue. |
| Comment by Gerrit Updater [ 16/Oct/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52264/ |
| Comment by Peter Jones [ 16/Oct/23 ] |
|
Landed for 2.16 |
| Comment by Alex Zhuravlev [ 16/Oct/23 ] |
yes, b_es6_0 needs that for sure, will check b2_15 |
| Comment by Gerrit Updater [ 16/Oct/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52710 |
| Comment by Gerrit Updater [ 13/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52710/ |
| Comment by Peter Jones [ 17/Nov/23 ] |
|
Will be included in 2.15.4 |