Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.15.2
    • None
    • 2
    • 9223372036854775807

    Description

      We have had multiple servers get dead lock with this stack trace.

      (attached longer console output)

      Jul 15 05:46:28 nbp11-srv3 kernel: INFO: task ll_ost07_000:9230 blocked for more than 120 seconds.
      Jul 15 05:46:28 nbp11-srv3 kernel:      Tainted: G           OE    --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
      Jul 15 05:46:28 nbp11-srv3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 15 05:46:28 nbp11-srv3 kernel: task:ll_ost07_000    state:D stack:    0 pid: 9230 ppid:     2 flags:0x80004080
      Jul 15 05:46:28 nbp11-srv3 kernel: Call Trace:
      Jul 15 05:46:28 nbp11-srv3 kernel: __schedule+0x2d1/0x860
      Jul 15 05:46:28 nbp11-srv3 kernel: schedule+0x35/0xa0
      Jul 15 05:46:28 nbp11-srv3 kernel: wait_transaction_locked+0x89/0xd0 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? finish_wait+0x80/0x80
      Jul 15 05:46:28 nbp11-srv3 kernel: add_transaction_credits+0xd4/0x290 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_do_update_inode+0x604/0x800 [ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: start_this_handle+0x10a/0x520 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? osd_fallocate_preallocate.isra.38+0x275/0x760 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_mark_iloc_dirty+0x32/0x90 [ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: jbd2__journal_restart+0xb4/0x160 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate_preallocate.isra.38+0x5a6/0x760 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate+0xfd/0x370 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: ofd_object_fallocate+0x5dd/0xa30 [ofd]
      Jul 15 05:46:28 nbp11-srv3 kernel: ofd_fallocate_hdl+0x467/0x730 [ofd]
      Jul 15 05:46:28 nbp11-srv3 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: kthread+0x10a/0x120
      Jul 15 05:46:28 nbp11-srv3 kernel: ? set_kthread_struct+0x50/0x50
      Jul 15 05:46:28 nbp11-srv3 kernel: ret_from_fork+0x1f/0x40
      

      Attachments

        1. brw_stats
          8 kB
        2. brw_stats.save.1693236421
          89 kB
        3. dmesg.out
          119 kB
        4. fallocate-range-locking.patch
          1 kB
        5. nbp15.hang
          45 kB
        6. stack.out
          51 kB
        7. stack1.out
          55 kB
        8. stack1-1.out
          55 kB

        Issue Links

          Activity

            [LU-16966] ofd_object_fallocate dead lock?

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52264/
            Subject: LU-16966 osd: take trunc_lock for fallocate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 51529fb57f85210e292a15c882cf25a4689ea77d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52264/ Subject: LU-16966 osd: take trunc_lock for fallocate Project: fs/lustre-release Branch: master Current Patch Set: Commit: 51529fb57f85210e292a15c882cf25a4689ea77d

            We applied the patch provided and we have not seeing the issue.

            mhanafi Mahmoud Hanafi added a comment - We applied the patch provided and we have not seeing the issue.

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52264
            Subject: LU-16966 osd: take trunc_lock for fallocate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 48a6d52640fe716760d1a369f4bc53ebdba25e6d

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52264 Subject: LU-16966 osd: take trunc_lock for fallocate Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 48a6d52640fe716760d1a369f4bc53ebdba25e6d
            bzzz Alex Zhuravlev added a comment - - edited

            mhanafi I'm very very sorry, but ... could you please instead apply the patch I'm attaching. I've gone through few code paths and now think there is another problem with fallocate and actually we better reuse range locking you still have in your tree to fix the problem.
            the path is fallocate-range-locking.patch

            bzzz Alex Zhuravlev added a comment - - edited mhanafi I'm very very sorry, but ... could you please instead apply the patch I'm attaching. I've gone through few code paths and now think there is another problem with fallocate and actually we better reuse range locking you still have in your tree to fix the problem. the path is fallocate-range-locking.patch

            Thanks we'll get the patch applied this week and let you know the results.

            mhanafi Mahmoud Hanafi added a comment - Thanks we'll get the patch applied this week and let you know the results.

            mhanafi could you please apply the patch just added? the patch reverts range locking.

            bzzz Alex Zhuravlev added a comment - mhanafi could you please apply the patch just added? the patch reverts range locking.

            Can we get an update please. We have filesystem hanging multiple times a day.

            mhanafisorry for the delay, I'm trying to reconstruct the problem using brw_stats you provided.

            bzzz Alex Zhuravlev added a comment - Can we get an update please. We have filesystem hanging multiple times a day. mhanafi sorry for the delay, I'm trying to reconstruct the problem using brw_stats you provided.

            Can we get an update please. We have filesystem hanging multiple times a day.

            mhanafi Mahmoud Hanafi added a comment - Can we get an update please. We have filesystem hanging multiple times a day.

            We were able to get some of our servers patched with LU-15564 and got brw_stats after a hang, see below. btw, if we set debug=+trace it reduce the chance of hitting this bug. 

            brw_stats.save.1693236421

            stack1.out

             

            mhanafi Mahmoud Hanafi added a comment - We were able to get some of our servers patched with LU-15564 and got brw_stats after a hang, see below. btw, if we set debug=+trace it reduce the chance of hitting this bug.  brw_stats.save.1693236421 stack1.out  
            cfaber Colin Faber added a comment -

            Hi mhanafi
            Sorry for the delay, you can try LU-15117 (there is a port already for b2_15). Likely LU-15894 may help but there is no port available, the patch bzzz is asking for you to try will provide additional statistics to help better understand the problem. I would suggest that you first start there, provide the requested statistics and allow us to better understand the problem before attempting other patches.

            cfaber Colin Faber added a comment - Hi mhanafi Sorry for the delay, you can try LU-15117 (there is a port already for b2_15). Likely LU-15894 may help but there is no port available, the patch bzzz is asking for you to try will provide additional statistics to help better understand the problem. I would suggest that you first start there, provide the requested statistics and allow us to better understand the problem before attempting other patches.

            Still waiting for answer to above question.

            mhanafi Mahmoud Hanafi added a comment - Still waiting for answer to above question.

            People

              bzzz Alex Zhuravlev
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: