Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6918

Deadlock on transaction with iget()/clear_inode()

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      Thread 1:

      schedule
      start_this_handle
      jbd2_journal_start
      ldiskfs_journal_start_sb
      ldiskfs_dquot_drop
      vfs_dq_drop
      clear_inode
      dispose_list
      shrink_icache_memory
      shrink_slab
      zone_reclaim
      get_page_from_freelist
      __alloc_pages_nodemask
      alloc_pages_vma
      do_huge_pmd_anonymous_page
      handle_mm_fault
      __do_page_fault
      do_page_fault
      page_fault
      

      Thread2:

      __wait_on_freeing_inode
      find_inode_fast
      ifind_fast
      iget_locked
      ldiskfs_iget
      osd_iget
      osd_index_ea_delete
      out_obj_index_delete
      out_tx_index_delete_exec
      out_tx_end
      out_handle
      tgt_request_handle
      ptlrpc_main
      kthread
      kernel_thread
      

      Attachments

        Issue Links

          Activity

            [LU-6918] Deadlock on transaction with iget()/clear_inode()
            pjones Peter Jones added a comment -

            Duplicate of LU-6969

            pjones Peter Jones added a comment - Duplicate of LU-6969
            bzzz Alex Zhuravlev added a comment - http://review.whamcloud.com/#/c/15924/

            We have several reports from different sites. It isn't easily reproducible.

            askulysh Andriy Skulysh added a comment - We have several reports from different sites. It isn't easily reproducible.

            Andriy, how easily can this deadlock be hit, and what is the workload to trigger it?

            adilger Andreas Dilger added a comment - Andriy, how easily can this deadlock be hit, and what is the workload to trigger it?

            no, we shouldn't do this in the target code because the target has no idea of agent inodes and it can't address that inode given an agent inode has no FID assigned. I think the only solution is to postpone inode destroy. this can be done in different ways. the most trivial is to have a list of inode numbers in memory. this can lead to an orphan, but given the number of agent inodes is very small, they don't occupy much space and at some point they will be discovered by LFSCK - probably good enough. if not, then we can do something similar to ext4_orphan_add()..

            bzzz Alex Zhuravlev added a comment - no, we shouldn't do this in the target code because the target has no idea of agent inodes and it can't address that inode given an agent inode has no FID assigned. I think the only solution is to postpone inode destroy. this can be done in different ways. the most trivial is to have a list of inode numbers in memory. this can lead to an orphan, but given the number of agent inodes is very small, they don't occupy much space and at some point they will be discovered by LFSCK - probably good enough. if not, then we can do something similar to ext4_orphan_add()..

            iget() waits for cleared I_FREEING flag, but clear_inode()/ldiskfs_dquot_drop() wants to start transaction at first and clears the flag after that.

            This behavior is common for most of target code. It takes transaction at first, after that locates an object by means of iget().
            Possible solution would be to pin inode before starting transaction in tgt code.

            askulysh Andriy Skulysh added a comment - iget() waits for cleared I_FREEING flag, but clear_inode()/ldiskfs_dquot_drop() wants to start transaction at first and clears the flag after that. This behavior is common for most of target code. It takes transaction at first, after that locates an object by means of iget(). Possible solution would be to pin inode before starting transaction in tgt code.

            People

              wc-triage WC Triage
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: