Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.15.2
    • None
    • 2
    • 9223372036854775807

    Description

      We have had multiple servers get dead lock with this stack trace.

      (attached longer console output)

      Jul 15 05:46:28 nbp11-srv3 kernel: INFO: task ll_ost07_000:9230 blocked for more than 120 seconds.
      Jul 15 05:46:28 nbp11-srv3 kernel:      Tainted: G           OE    --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
      Jul 15 05:46:28 nbp11-srv3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 15 05:46:28 nbp11-srv3 kernel: task:ll_ost07_000    state:D stack:    0 pid: 9230 ppid:     2 flags:0x80004080
      Jul 15 05:46:28 nbp11-srv3 kernel: Call Trace:
      Jul 15 05:46:28 nbp11-srv3 kernel: __schedule+0x2d1/0x860
      Jul 15 05:46:28 nbp11-srv3 kernel: schedule+0x35/0xa0
      Jul 15 05:46:28 nbp11-srv3 kernel: wait_transaction_locked+0x89/0xd0 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? finish_wait+0x80/0x80
      Jul 15 05:46:28 nbp11-srv3 kernel: add_transaction_credits+0xd4/0x290 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_do_update_inode+0x604/0x800 [ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: start_this_handle+0x10a/0x520 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? osd_fallocate_preallocate.isra.38+0x275/0x760 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ldiskfs_mark_iloc_dirty+0x32/0x90 [ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: jbd2__journal_restart+0xb4/0x160 [jbd2]
      Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate_preallocate.isra.38+0x5a6/0x760 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: osd_fallocate+0xfd/0x370 [osd_ldiskfs]
      Jul 15 05:46:28 nbp11-srv3 kernel: ofd_object_fallocate+0x5dd/0xa30 [ofd]
      Jul 15 05:46:28 nbp11-srv3 kernel: ofd_fallocate_hdl+0x467/0x730 [ofd]
      Jul 15 05:46:28 nbp11-srv3 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
      Jul 15 05:46:28 nbp11-srv3 kernel: kthread+0x10a/0x120
      Jul 15 05:46:28 nbp11-srv3 kernel: ? set_kthread_struct+0x50/0x50
      Jul 15 05:46:28 nbp11-srv3 kernel: ret_from_fork+0x1f/0x40
      

      Attachments

        1. brw_stats
          8 kB
        2. brw_stats.save.1693236421
          89 kB
        3. dmesg.out
          119 kB
        4. fallocate-range-locking.patch
          1 kB
        5. nbp15.hang
          45 kB
        6. stack.out
          51 kB
        7. stack1.out
          55 kB
        8. stack1-1.out
          55 kB

        Issue Links

          Activity

            [LU-16966] ofd_object_fallocate dead lock?

            We were able to get some of our servers patched with LU-15564 and got brw_stats after a hang, see below. btw, if we set debug=+trace it reduce the chance of hitting this bug. 

            brw_stats.save.1693236421

            stack1.out

             

            mhanafi Mahmoud Hanafi added a comment - We were able to get some of our servers patched with LU-15564 and got brw_stats after a hang, see below. btw, if we set debug=+trace it reduce the chance of hitting this bug.  brw_stats.save.1693236421 stack1.out  
            cfaber Colin Faber added a comment -

            Hi mhanafi
            Sorry for the delay, you can try LU-15117 (there is a port already for b2_15). Likely LU-15894 may help but there is no port available, the patch bzzz is asking for you to try will provide additional statistics to help better understand the problem. I would suggest that you first start there, provide the requested statistics and allow us to better understand the problem before attempting other patches.

            cfaber Colin Faber added a comment - Hi mhanafi Sorry for the delay, you can try LU-15117 (there is a port already for b2_15). Likely LU-15894 may help but there is no port available, the patch bzzz is asking for you to try will provide additional statistics to help better understand the problem. I would suggest that you first start there, provide the requested statistics and allow us to better understand the problem before attempting other patches.

            Still waiting for answer to above question.

            mhanafi Mahmoud Hanafi added a comment - Still waiting for answer to above question.
            mhanafi Mahmoud Hanafi added a comment - - edited

            We are going to do a build with LU-15564. Should we also pick up LU-15117 and revert LU-15894?

            We would need a backport for LU-15117.

            mhanafi Mahmoud Hanafi added a comment - - edited We are going to do a build with LU-15564 . Should we also pick up LU-15117 and revert LU-15894 ? We would need a backport for LU-15117 .

            Is there some additional debugging we can do to help with this issue.

            would it be possible to apply LU-15564 so we can track allocation time?

            bzzz Alex Zhuravlev added a comment - Is there some additional debugging we can do to help with this issue. would it be possible to apply LU-15564 so we can track allocation time?

            Is there some additional debugging we can do to help with this issue.

            mhanafi Mahmoud Hanafi added a comment - Is there some additional debugging we can do to help with this issue.

            btw_stats attached

            unfortunately, LU-15564 has not been landed yet in that tree. that could help to understand whether it's block allocation related or not.

            bzzz Alex Zhuravlev added a comment - btw_stats attached unfortunately, LU-15564 has not been landed yet in that tree. that could help to understand whether it's block allocation related or not.

            looking at the code in that tree I noticed couple things:

            • LU-15117 is missing (not sure it's really related, but I saw few RPC timeouts and that could cause similar symptoms)
            • LU-15894 is not reverted (and I see one trace stuck at the range lock)
            bzzz Alex Zhuravlev added a comment - looking at the code in that tree I noticed couple things: LU-15117 is missing (not sure it's really related, but I saw few RPC timeouts and that could cause similar symptoms) LU-15894 is not reverted (and I see one trace stuck at the range lock)

            Initially we were running wc 2.15.2. After we got the patch we were running 2.15.3 + LU-15800 (2.15.3-2nas) 

            https://github.com/champios/lustre-nas)

            btw_stats attached.

            brw_stats

            mhanafi Mahmoud Hanafi added a comment - Initially we were running wc 2.15.2. After we got the patch we were running 2.15.3 + LU-15800 (2.15.3-2nas)  https://github.com/champios/lustre-nas ) btw_stats attached. brw_stats

            b2_15-nas is the only "15" one, right?

            bzzz Alex Zhuravlev added a comment - b2_15-nas is the only "15" one, right?

            thanks Peter

            bzzz Alex Zhuravlev added a comment - thanks Peter

            People

              bzzz Alex Zhuravlev
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: