Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.15.2
-
None
-
Lustre version:
lustre-iokit-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
kmod-lustre-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
pcp-lustre-0.4.16-2.noarch
lustre-devel-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
lustre-osd-ldiskfs-mount-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
lustre-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
lustre-tests-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
kmod-lustre-osd-ldiskfs-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
kmod-lustre-tests-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64
kernel: 4.18.0-425.3.1.el8_lustre.x86_64
mofed: mlnx-ofa_kernel-4.9-mofed496.x86_64Lustre version: lustre-iokit-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 kmod-lustre-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 pcp-lustre-0.4.16-2.noarch lustre-devel-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 lustre-osd-ldiskfs-mount-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 lustre-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 lustre-tests-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 kmod-lustre-osd-ldiskfs-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 kmod-lustre-tests-2.15.2-1nas_mofed496el8_lustre_20230111v1.x86_64 kernel: 4.18.0-425.3.1.el8_lustre.x86_64 mofed: mlnx-ofa_kernel-4.9-mofed496.x86_64
-
2
-
9223372036854775807
Description
We have had multiple servers get dead lock with this stack trace.
(attached longer console output)
Jul 15 05:46:28 nbp11-srv3 kernel: INFO: task ll_ost07_000:9230 blocked for more than 120 seconds. Jul 15 05:46:28 nbp11-srv3 kernel: Tainted: G OE --------- - - 4.18.0-425.3.1.el8_lustre.x86_64 #1 Jul 15 05:46:28 nbp11-srv3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 15 05:46:28 nbp11-srv3 kernel: task:ll_ost07_000 state:D stack: 0 pid: 9230 ppid: 2 flags:0x80004080 Jul 15 05:46:28 nbp11-srv3 kernel: Call Trace: Jul 15 05:46:28 nbp11-srv3 kernel: __schedule+0x2d1/0x860 Jul 15 05:46:28 nbp11-srv3 kernel: schedule+0x35/0xa0 Jul 15 05:46:28 nbp11-srv3 kernel: wait_transaction_locked+0x89/0xd0 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: ? finish_wait+0x80/0x80 Jul 15 05:46:28 nbp11-srv3 kernel: add_transaction_credits+0xd4/0x290 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_do_update_inode+0x604/0x800 [ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: start_this_handle+0x10a/0x520 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: ? osd_fallocate_preallocate.isra.38+0x275/0x760 [osd_ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_mark_iloc_dirty+0x32/0x90 [ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: jbd2__journal_restart+0xb4/0x160 [jbd2] Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate_preallocate.isra.38+0x5a6/0x760 [osd_ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate+0xfd/0x370 [osd_ldiskfs] Jul 15 05:46:28 nbp11-srv3 kernel: ofd_object_fallocate+0x5dd/0xa30 [ofd] Jul 15 05:46:28 nbp11-srv3 kernel: ofd_fallocate_hdl+0x467/0x730 [ofd] Jul 15 05:46:28 nbp11-srv3 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] Jul 15 05:46:28 nbp11-srv3 kernel: kthread+0x10a/0x120 Jul 15 05:46:28 nbp11-srv3 kernel: ? set_kthread_struct+0x50/0x50 Jul 15 05:46:28 nbp11-srv3 kernel: ret_from_fork+0x1f/0x40
Attachments
Issue Links
- duplicates
-
LU-15800 Fallocate causes transaction deadlock
-
- Resolved
-
mhanafi I'm very very sorry, but ...
could you please instead apply the patch I'm attaching. I've gone through few code paths and now think there is another problem with fallocate and actually we better reuse range locking you still have in your tree to fix the problem.
the path is fallocate-range-locking.patch