Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15800

Fallocate causes transaction deadlock

Details

    • 3
    • 9223372036854775807

    Description

      PID: 74368  TASK: ffff9600eaeac740  CPU: 9   COMMAND: "ll_ost_io02_069"
       #0 [ffffa3f1a7a57830] __schedule at ffffffff9034e1d4
       #1 [ffffa3f1a7a578c8] schedule at ffffffff9034e648
       #2 [ffffa3f1a7a578d8] rwsem_down_read_slowpath at ffffffff903511d0
       #3 [ffffa3f1a7a57978] osd_read_lock at ffffffffc1a3379d [osd_ldiskfs]
                                      <--     rc = dt_trans_start_local(env, ofd->ofd_osd , th);
                                              ofd_read_lock(env, ofd_obj);
       #4 [ffffa3f1a7a57998] ofd_write_attr_set at ffffffffc186b6cc [ofd]
       #5 [ffffa3f1a7a57a00] ofd_commitrw_write at ffffffffc186c812 [ofd]
       #6 [ffffa3f1a7a57aa0] ofd_commitrw at ffffffffc18721f1 [ofd]
       #7 [ffffa3f1a7a57b60] finish_wait at ffffffff8fb2e5ac
       #8 [ffffa3f1a7a57bd8] tgt_brw_write at ffffffffc1255544 [ptlrpc]
      
      PID: 73559  TASK: ffff9601653a97c0  CPU: 11  COMMAND: "ll_ost02_046"
       #0 [ffffa3f1a0817970] __schedule at ffffffff9034e1d4
       #1 [ffffa3f1a0817a08] schedule at ffffffff9034e648
       #2 [ffffa3f1a0817a18] wait_transaction_locked at ffffffffc0ad2089 [jbd2]
       #3 [ffffa3f1a0817a68] add_transaction_credits at ffffffffc0ad21c4 [jbd2]
       #4 [ffffa3f1a0817ac0] start_this_handle at ffffffffc0ad250a [jbd2]
       #5 [ffffa3f1a0817b40] jbd2__journal_restart at ffffffffc0ad2ad0 [jbd2]
       #6 [ffffa3f1a0817b80] osd_fallocate_preallocate at ffffffffc1a5b6d2 [osd_ldiskfs]
       #7 [ffffa3f1a0817c18] osd_fallocate at ffffffffc1a5b98d [osd_ldiskfs]
                              <--     ofd_trans_start(env, ofd, fo, th);
                                      ofd_write_lock(env, fo);
       #8 [ffffa3f1a0817c50] ofd_object_fallocate at ffffffffc18682f9 [ofd]
       #9 [ffffa3f1a0817cb8] ofd_fallocate_hdl at ffffffffc185912f [ofd]
      #10 [ffffa3f1a0817d50] tgt_request_handle at ffffffffc1256a53 [ptlrpc]

      The deadlock was added by :

       Commit:         93f700ca241a98630fc5ff19a041e35fbdbf0385
       Author:         Arshad Hussain <arshad.super@gmail.com>
       Committer:      Oleg Drokin <green@whamcloud.com>
       Author Date:    Thu 10 Sep 2020 02:18:13 AM EEST
       Committer Date: Thu 29 Oct 2020 06:28:42 AM EET
      
       LU-13765 osd-ldiskfs: Extend credit correctly for fallocate
      

      Attachments

        Issue Links

          Activity

            [LU-15800] Fallocate causes transaction deadlock
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47268/
            Subject: LU-15800 ofd: take a read lock for fallocate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5fae80066162ea637c8649f6439fc14e1d9a7cf8

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47268/ Subject: LU-15800 ofd: take a read lock for fallocate Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5fae80066162ea637c8649f6439fc14e1d9a7cf8

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47268
            Subject: LU-15800 ofd: take a read lock for fallocate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ce637a38fd863f07c1e9a35f9a7c0731d858c23e

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47268 Subject: LU-15800 ofd: take a read lock for fallocate Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ce637a38fd863f07c1e9a35f9a7c0731d858c23e
            pjones Peter Jones added a comment -

            Given that this issue existed in 2.14, I think that it should be ok to descope it from 2.15.0 and include in a future 2.15.x maintenance release.

            pjones Peter Jones added a comment - Given that this issue existed in 2.14, I think that it should be ok to descope it from 2.15.0 and include in a future 2.15.x maintenance release.

            Andriy,

            Is there a test case(or manual steps) that can trigger this issue? Environment details would also help. (how large was the fallocate?) At least in my case, It would greatly help to have such details/reproducer. I tried to reproduce the bug running standard sanity/sanityn test-case over loop but failed to reproduce the deadlock. At-least the standard test-case does not catch/trigger this. With the stack trace you provided, it looks like there are two threads one doing fallocate(standard prealloc) other doing a write(eg dd). ?

            >...in the code but it is violated  by jbd2__journal_restart(). It shouldn't be called under ofd_write_lock()

            Sorry, I thought this is fine under ofd lock. Can you please explain in more details? 

            osd_extend_restart_trans()
                ->ldiskfs_journal_restart()
                    -> jbd2_journal_restart()
                        -> jbd2__journal_restart()

             

            Thanks

            arshad512 Arshad Hussain added a comment - Andriy, Is there a test case(or manual steps) that can trigger this issue? Environment details would also help. (how large was the fallocate?) At least in my case, It would greatly help to have such details/reproducer. I tried to reproduce the bug running standard sanity/sanityn test-case over loop but failed to reproduce the deadlock. At-least the standard test-case does not catch/trigger this. With the stack trace you provided, it looks like there are two threads one doing fallocate(standard prealloc) other doing a write(eg dd). ? >...in the code but it is violated  by jbd2__journal_restart(). It shouldn't be called under ofd_write_lock() Sorry, I thought this is fine under ofd lock. Can you please explain in more details?  osd_extend_restart_trans()     ->ldiskfs_journal_restart()         -> jbd2_journal_restart()             -> jbd2__journal_restart()   Thanks

            Hi Peter, I am looking into it.

            Thanks

            arshad512 Arshad Hussain added a comment - Hi Peter, I am looking into it. Thanks
            pjones Peter Jones added a comment -

            Arshad

            Is this something that you are able to look into?

            Peter

            pjones Peter Jones added a comment - Arshad Is this something that you are able to look into? Peter

            jhammond, it isn't a duplicate of LU-14214, the locking is correct in the code but it is violated  by jbd2__journal_restart(). It shouldn't be called under ofd_write_lock()

            askulysh Andriy Skulysh added a comment - jhammond , it isn't a duplicate of LU-14214 , the locking is correct in the code but it is violated  by jbd2__journal_restart(). It shouldn't be called under ofd_write_lock()

            People

              arshad512 Arshad Hussain
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: