Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4090

OST unavailable due to possible deadlock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Lustre 2.7.0, Lustre 2.5.3
    • Lustre 1.8.8
    • None
    • 3
    • 10994

    Description

      One OST became unavailable ane kept on dumping stack traces until its service is taken over by another OSS. This issue occured a couple of time on different servers.

      After some inverstigation, we found that a lot of service theads hang at different places. Here is a list of where they stuck.

      ll_ost_01:10226,-ll_ost_07:10232,-ll_ost_09:10234,-ll_ost_11:10236,-ll_ost_13:10238,-ll_ost_15:10240,-ll_ost_18:10243
      filter_lvbo_init
      --filter_fid2dentry
      ----filter_parent_lock
      ------filter_lock_dentry
      -------LOCK_INODE_MUTEX(dparent>d_inode);

      ll_ost_06:10231,-ll_ost_16:10241,-ll_ost_484,-ll_ost_io_129,-ll_ost_io_123,-ll_ost_383
      fsfilt_ext3_start
      --ext3_journal_start
      ----journal_start
      ------start_this_handle
      ----------__jbd2_log_wait_for_space
      -----------mutex_lock(&journal>j_checkpoint_mutex);

      ll_ost_17:10242
      filter_lvbo_init
      --filter_fid2dentry
      ----filter_parent_lock
      ----lookup_one_len
      ------__lookup_hash
      -------inode>i_op->lookup-=-ext4_lookup
      ----------ext4_iget
      ------------iget_locked
      --------------ifind_fast
      ----------------find_inode_fast
      ------------------__wait_on_freeing_inode
      -------------------?ldiskfs_bread...-Child-dentry's-inode__I_LOCK

      ll_ost_io_15
      ost_brw_write
      --filter_commitrw_write
      ----fsfilt_ext3_commit_wait
      ------autoremove_wake_function
      -------fsfilt_log_wait_commit=-jbd2_log_wait_commit

      We think that is not neccessarily the problem of Lustre codes. And we found a nearly merged patch which fixes a similar deadlock problem in __jbd2_log_wait_for_space(). Maybe it is the root cause?

      https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/fs/jbd2/checkpoint.c?id=0ef54180e0187117062939202b96faf04c8673bc

      Attachments

        1. 0001-LU-4090-fsfilt-don-t-wait-forever-for-stale-tid.patch
          6 kB
          Zhenyu Xu
        2. ALPL202.messages_20150518.txt
          307 kB
          Wang Shilong
        3. messages.ALPL401.txt
          144 kB
          Wang Shilong
        4. messages.ALPL402.txt
          1.12 MB
          Wang Shilong

        Activity

          People

            bobijam Zhenyu Xu
            lixi Li Xi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: