Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.15.2
    • None
    • 2
    • 9223372036854775807

    Description

      We have had multiple servers get dead lock with this stack trace.

      (attached longer console output)

      Jul 15 05:46:28 nbp11-srv3 kernel: INFO: task ll_ost07_000:9230 blocked for more than 120 seconds.
      Jul 15 05:46:28 nbp11-srv3 kernel:      Tainted: G           OE    --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
      Jul 15 05:46:28 nbp11-srv3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 15 05:46:28 nbp11-srv3 kernel: task:ll_ost07_000    state:D stack:    0 pid: 9230 ppid:     2 flags:0x80004080
      Jul 15 05:46:28 nbp11-srv3 kernel: Call Trace:
      Jul 15 05:46:28 nbp11-srv3 kernel: __schedule+0x2d1/0x860
      Jul 15 05:46:28 nbp11-srv3 kernel: schedule+0x35/0xa0
      Jul 15 05:46:28 nbp11-srv3 kernel: wait_transaction_locked+0x89/0xd0 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? finish_wait+0x80/0x80
      Jul 15 05:46:28 nbp11-srv3 kernel: add_transaction_credits+0xd4/0x290 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_do_update_inode+0x604/0x800 [ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: start_this_handle+0x10a/0x520 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? osd_fallocate_preallocate.isra.38+0x275/0x760 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_mark_iloc_dirty+0x32/0x90 [ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: jbd2__journal_restart+0xb4/0x160 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate_preallocate.isra.38+0x5a6/0x760 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate+0xfd/0x370 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: ofd_object_fallocate+0x5dd/0xa30 [ofd]
      Jul 15 05:46:28 nbp11-srv3 kernel: ofd_fallocate_hdl+0x467/0x730 [ofd]
      Jul 15 05:46:28 nbp11-srv3 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: kthread+0x10a/0x120
      Jul 15 05:46:28 nbp11-srv3 kernel: ? set_kthread_struct+0x50/0x50
      Jul 15 05:46:28 nbp11-srv3 kernel: ret_from_fork+0x1f/0x40
      

      Attachments

        1. dmesg.out
          119 kB
        2. nbp15.hang
          45 kB
        3. stack.out
          51 kB
        4. brw_stats
          8 kB
        5. brw_stats.save.1693236421
          89 kB
        6. stack1.out
          55 kB
        7. stack1-1.out
          55 kB
        8. fallocate-range-locking.patch
          1 kB

        Issue Links

          Activity

            [LU-16966] ofd_object_fallocate dead lock?
            pjones Peter Jones added a comment -

            Will be included in 2.15.4

            pjones Peter Jones added a comment - Will be included in 2.15.4

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52710/
            Subject: LU-16966 osd: take trunc_lock for fallocate
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 9c97d1969e2298fdfe5daa616e36cbe17a9b3d5e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52710/ Subject: LU-16966 osd: take trunc_lock for fallocate Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 9c97d1969e2298fdfe5daa616e36cbe17a9b3d5e

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52710
            Subject: LU-16966 osd: take trunc_lock for fallocate
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: d6e549e9ea2eb6e7b141203dad0130cc8da5f1db

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52710 Subject: LU-16966 osd: take trunc_lock for fallocate Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: d6e549e9ea2eb6e7b141203dad0130cc8da5f1db

            Does this issue also affect b_es6_0 and b2_15?

            yes, b_es6_0 needs that for sure, will check b2_15

            bzzz Alex Zhuravlev added a comment - Does this issue also affect b_es6_0 and b2_15? yes, b_es6_0 needs that for sure, will check b2_15
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52264/
            Subject: LU-16966 osd: take trunc_lock for fallocate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 51529fb57f85210e292a15c882cf25a4689ea77d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52264/ Subject: LU-16966 osd: take trunc_lock for fallocate Project: fs/lustre-release Branch: master Current Patch Set: Commit: 51529fb57f85210e292a15c882cf25a4689ea77d

            We applied the patch provided and we have not seeing the issue.

            mhanafi Mahmoud Hanafi added a comment - We applied the patch provided and we have not seeing the issue.

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52264
            Subject: LU-16966 osd: take trunc_lock for fallocate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 48a6d52640fe716760d1a369f4bc53ebdba25e6d

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52264 Subject: LU-16966 osd: take trunc_lock for fallocate Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 48a6d52640fe716760d1a369f4bc53ebdba25e6d
            bzzz Alex Zhuravlev added a comment - - edited

            mhanafi I'm very very sorry, but ... could you please instead apply the patch I'm attaching. I've gone through few code paths and now think there is another problem with fallocate and actually we better reuse range locking you still have in your tree to fix the problem.
            the path is fallocate-range-locking.patch

            bzzz Alex Zhuravlev added a comment - - edited mhanafi I'm very very sorry, but ... could you please instead apply the patch I'm attaching. I've gone through few code paths and now think there is another problem with fallocate and actually we better reuse range locking you still have in your tree to fix the problem. the path is fallocate-range-locking.patch

            Thanks we'll get the patch applied this week and let you know the results.

            mhanafi Mahmoud Hanafi added a comment - Thanks we'll get the patch applied this week and let you know the results.

            People

              bzzz Alex Zhuravlev
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: