Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11685

removing file names and freeing inodes are not atomic on the OST

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Upstream
    • None
    • 3
    • 9223372036854775807

    Description

      Destroying objects on OSTs from ldiskfs layout POV includes removing them from their directories and freeing their inodes. Since in ext4 they are usually performed in different jbd2 transactions, the first part in ext4 also links the inode to the orphan list. However, in Lustre 2.11 and master there are two different transactions and no orphan list linkage between them.

                 <...>-112262 [000] .... 37386.078026: ldiskfs_free_inode <-ldiskfs_evict_inode
                 <...>-112262 [000] .... 37386.078050: <stack trace>
       => evict
       => iput
       => osd_object_delete
       => lu_object_free.isra.30
       => lu_object_put
       => ofd_destroy_by_fid
       => ofd_destroy_hdl
       => tgt_request_handle
       => ptlrpc_server_handle_request
       => ptlrpc_main
       => kthread
       => ret_from_fork

      The issue can be easily reproduced by forcing transaction commit between those two and crashing the system:

      diff --git a/lustre/ofd/ofd_objects.c b/lustre/ofd/ofd_objects.c
      index 7605de4..f973923 100644
      --- a/lustre/ofd/ofd_objects.c
      +++ b/lustre/ofd/ofd_objects.c
      @@ -876,6 +876,11 @@ stop:
                      rc = rc2;
       unlock:
              ofd_write_unlock(env, fo);
      +
      +       set_current_state(TASK_UNINTERRUPTIBLE);
      +       schedule_timeout(msecs_to_jiffies(MSEC_PER_SEC * 6));
      +       BUG();
      +
              RETURN(rc);
       }
      

      e2fsck finds inconsistency on the OST image:

      [root@panda-vbox lustre-release]# e2fsck -f /dev/sdd
      e2fsck 1.42.13.wc6 (05-Feb-2017)
      lustre-OST0000: recovering journal
      Pass 1: Checking inodes, blocks, and sizes
      Deleted inode 134 has zero dtime.  Fix<y>? yes <==============
      Pass 2: Checking directory structure
      Pass 3: Checking directory connectivity
      Pass 4: Checking reference counts
      Pass 5: Checking group summary information
      Free blocks count wrong (80310, counted=79979).
      Fix<y>? yes
      Inode bitmap differences:  -134
      Fix<y>? yes
      Free inodes count wrong for group #0 (24835, counted=24836).
      Fix<y>? yes
      Free inodes count wrong (99987, counted=99668).
      Fix<y>? yes
      [QUOTA WARNING] Usage inconsistent for ID 0:actual (1392640, 323) != expected (1392640, 324)
      Update quota info for quota type 0<y>? yes
      [QUOTA WARNING] Usage inconsistent for ID 0:actual (1392640, 323) != expected (1392640, 324)
      Update quota info for quota type 1<y>? yes
      
      lustre-OST0000: ***** FILE SYSTEM WAS MODIFIED *****
      lustre-OST0000: 332/100000 files (0.3% non-contiguous), 20021/100000 blocks
      [root@panda-vbox lustre-release]#
      

      Before suggesting a patch, I'd like to discuss possible solutions. Should we try extending the transaction so as to include the final iput? Should we use any other approach?

      Attachments

        Issue Links

          Activity

            [LU-11685] removing file names and freeing inodes are not atomic on the OST

            I have a patch calling ext4_orphan_add() as ext4 does, but wasn't sure about performance impact as well.
            my thought was that on OST we don't care that much about destroy rate while on MDT most of regular files are empty and we could call ldiskfs_delete_inode() directly.

            bzzz Alex Zhuravlev added a comment - I have a patch calling ext4_orphan_add() as ext4 does, but wasn't sure about performance impact as well. my thought was that on OST we don't care that much about destroy rate while on MDT most of regular files are empty and we could call ldiskfs_delete_inode() directly.

            Thank you!

            panda Andrew Perepechko added a comment - Thank you!

            I meant to link LU-10048 here.

            Alex exported ext4_orphan_add() as part of that patch, which could be used here to avoid this issue. However, that may have a performance impact that we don't want.

            An initial patch was proposed to ext4 to multi-thread the orphan list along with further discussion on how to make that patch better at:
            https://patchwork.ozlabs.org/patch/461805

            adilger Andreas Dilger added a comment - I meant to link LU-10048 here. Alex exported ext4_orphan_add() as part of that patch, which could be used here to avoid this issue. However, that may have a performance impact that we don't want. An initial patch was proposed to ext4 to multi-thread the orphan list along with further discussion on how to make that patch better at: https://patchwork.ozlabs.org/patch/461805

            adilger, possibly, a typo? The nodemap issue looks unrelated.

            panda Andrew Perepechko added a comment - adilger , possibly, a typo? The nodemap issue looks unrelated.

            People

              wc-triage WC Triage
              panda Andrew Perepechko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: