Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13389

aborted file create may leave unattached inodes on MDS.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      aborted mdtest job leaves unattached inodes:

      Job Script: command started at Mon Mar 2 16:40:31 CST 2020
      
      -- started at 03/02/2020 16:40:32 --
      
       
      
      mdtest-1.9.3 was launched with 32 total task(s) on 1 node(s)
      
      Command line used: /cray/css/ostest/binaries/xt/rel.70.aries.cray/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/mdtest/mdtest -f 32 -l 32 -n10000 -i1 -d /lus/snx11281/disk/ostest.vers/alsorun.20200302130005.752.saturn-p4/CL_mdtest_4s_fo.4.tjgw4U.1583188829/CL_mdtest_4s_fo -tuv
      
      V-1: main: Setting create/stat/read/remove_only to True
      
      V-1: Entering valid_tests...
      
      barriers                : True
      
      collective_creates      : False
      
      create_only             : True
      
      dirpath(s):
      
          /lus/snx11281/disk/ostest.vers/alsorun.20200302130005.752.saturn-p4/CL_mdtest_4s_fo.4.tjgw4U.1583188829/CL_mdtest_4s_fo
      
      dirs_only               : True
      
      read_bytes              : 0
      
      read_only               : True
      
      first                   : 32
      
      files_only              : True
      
      iterations              : 1
      
      items_per_dir           : 0
      
      last                    : 32
      
      leaf_only               : False
      
      items                   : 10000
      
      nstride                 : 0
      
      pre_delay               : 0
      
      remove_only             : False
      
      random_seed             : 0
      
      stride                  : 1
      
      shared_file             : False
      
      time_unique_dir_overhead: True
      
      stat_only               : True
      
      unique_dir_per_task     : True
      
      write_bytes             : 0
      
      sync_file               : False
      
      depth                   : 0
      
      V-1: Entering display_freespace...
      
      V-1: Entering show_file_system_size...
      
      Path: /lus/snx11281/disk/ostest.vers/alsorun.20200302130005.752.saturn-p4/CL_mdtest_4s_fo.4.tjgw4U.1583188829
      
      FS: 483.4 TiB   Used FS: 1.5%   Inodes: 487.0 Mi   Used Inodes: 0.1%
      
       
      
      32 tasks, 320000 files/directories
      
       
      
         Operation               Duration              Rate
      
         ---------               --------              ----
      
      V-1: main: * iteration 1 *
      
      V-1: Entering create_remove_directory_tree, currDepth = 0...
      
      V-1: Entering create_remove_directory_tree, currDepth = 1...
      
      V-1: main:   Tree creation     :          0.011 sec,         88.658 ops/sec
      
      V-1: Entering directory_test...
      
      V-1: Entering unique_dir_access...
      
      V-1: Entering create_remove_items, currDepth = 0...
      
      V-1: Entering create_remove_items_helper...
      
      V-1: Entering unique_dir_access...
      
      V-1: Entering mdtest_stat...
      
      V-1: Entering unique_dir_access...
      
      V-1: Entering unique_dir_access...
      
      V-1: Entering create_remove_items, currDepth = 0...
      
      V-1: Entering create_remove_items_helper...
      
      V-1: Entering unique_dir_access...
      
      V-1:   Directory creation:         25.644 sec,      12478.741 ops/sec
      
      V-1:   Directory stat    :          4.194 sec,      76299.371 ops/sec
      
      V-1:   Directory removal :          8.918 sec,      35883.016 ops/sec
      
      V-1: Entering file_test...
      
      V-1: Entering unique_dir_access...
      
      V-1: Entering create_remove_items, currDepth = 0...
      
      V-1: Entering create_remove_items_helper...
      
      aprun: Apid 5441901: Caught signal Terminated, sending to application
      
      _pmiu_daemon(SIGCHLD): [NID 00545] [c0-1c2s8n1] [Mon Mar  2 18:15:52 2020] PE RANK 25 exit signal Terminated
      
      Application 5441901 exit codes: 143
      
      Application 5441901 resources: utime ~0s, stime ~6s, Rss ~11768, inblocks ~0, outblocks ~0
      
      Job Script: command stopped at Mon Mar 2 18:16:38 CST 2020
      
      Job Script: command runtime was 5767 seconds
      
      

      e2fsck logs:

      [root@snx11281n000 ~]# grep Unattached /home/admin/e2fsck.*
      /home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056499
      /home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056500
      /home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056501
      /home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056507
      /home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056509
      /home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056510
      [root@snx11281n000 ~]#
      

      unattached inode stat:

      debugfs:  stat <3367056499>
      
      Inode: 3367056499   Type: regular    Mode:  0644   Flags: 0x0
      
      Generation: 3271840411    Version: 0x000001ee:00bad136
      
      User:  1356   Group: 11121   Project:     0   Size: 0
      
      File ACL: 0
      
      Links: 1   Blockcount: 0
      
      Fragment:  Address: 0    Number: 0    Size: 0
      
       ctime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
       atime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
       mtime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
      crtime: 0x5e5da1e3:c1b98cf8 -- Mon Mar  2 18:16:35 2020
      
      Size of extra inode fields: 32
      
      Extended attributes:
      
        trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 84 46 01 00 00 00 00 00
      
        lma: fid=[0x20010116e:0x14684:0x0] compat=0 incompat=0
      
        trusted.lov (144)
      
        trusted.link (60)
      
      BLOCKS:
      
      
      
      debugfs:  stat <3367056500>
      
      Inode: 3367056500   Type: regular    Mode:  0644   Flags: 0x0
      
      Generation: 3271840414    Version: 0x000001ee:00bad13a
      
      User:  1356   Group: 11121   Project:     0   Size: 0
      
      File ACL: 0
      
      Links: 1   Blockcount: 0
      
      Fragment:  Address: 0    Number: 0    Size: 0
      
       ctime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
       atime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
       mtime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
      crtime: 0x5e5da1e3:c1b98cf8 -- Mon Mar  2 18:16:35 2020
      
      Size of extra inode fields: 32
      
      Extended attributes:
      
        trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 85 46 01 00 00 00 00 00
      
        lma: fid=[0x20010116e:0x14685:0x0] compat=0 incompat=0
      
        trusted.lov (144)
      
        trusted.link (61)
      
      BLOCKS:
      
       
      
      
      debugfs:  stat <3367056501>
      
      Inode: 3367056501   Type: regular    Mode:  0644   Flags: 0x0
      
      Generation: 3271840413    Version: 0x000001ee:00bad143
      
      User:  1356   Group: 11121   Project:     0   Size: 0
      
      File ACL: 0
      
      Links: 1   Blockcount: 0
      
      Fragment:  Address: 0    Number: 0    Size: 0
      
       ctime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
       atime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
       mtime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020
      
      crtime: 0x5e5da1e3:c1b98cf8 -- Mon Mar  2 18:16:35 2020
      
      Size of extra inode fields: 32
      
      Extended attributes:
      
        trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 83 46 01 00 00 00 00 00
      
        lma: fid=[0x20010116e:0x14683:0x0] compat=0 incompat=0
      
        trusted.lov (96)
      
        trusted.link (61)
      
      BLOCKS:
      
       
      
      
      debugfs:  stat <3367056507>
      
      Inode: 3367056507   Type: regular    Mode:  0644   Flags: 0x0
      
      Generation: 3271840428    Version: 0x000001ee:00bad146
      
      User:  1356   Group: 11121   Project:     0   Size: 0
      
      File ACL: 0
      
      Links: 1   Blockcount: 0
      
      Fragment:  Address: 0    Number: 0    Size: 0
      
       ctime: 0x5e5d9f96:00000000 -- Mon Mar  2 18:06:46 2020
      
       atime: 0x5e5d9f96:00000000 -- Mon Mar  2 18:06:46 2020
      
       mtime: 0x5e5d9f96:00000000 -- Mon Mar  2 18:06:46 2020
      
      crtime: 0x5e5da1e3:c1f69600 -- Mon Mar  2 18:16:35 2020
      
      Size of extra inode fields: 32
      
      Extended attributes:
      
        trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 98 46 01 00 00 00 00 00
      
        lma: fid=[0x20010116e:0x14698:0x0] compat=0 incompat=0
      
        trusted.lov (144)
      
        trusted.link (61)
      
      BLOCKS:
      
       
      

      dmesg contains the following Lustre error message at the moment of inode creation:

      Mar  2 18:16:35 snx11281n002 kernel: LustreError: 2640:0:(osd_handler.c:2009:osd_trans_stop()) snx11281-MDT0000: failed in transaction hook: rc = -114
      

      LFSCK was able to successfully re-attach the inodes:

      00100000:10000000:1.0:1583880053.552740:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14682:0x0] with the name file.mdtest.17.4708 and type 100000 to the parent [0x200101166:0x1731:0x0]: rc = 1
      00100000:10000000:1.0:1583880053.626588:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14683:0x0] with the name file.mdtest.11.4748 and type 100000 to the parent [0x200101166:0x1727:0x0]: rc = 1
      00100000:10000000:9.0:1583880053.699071:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14684:0x0] with the name file.mdtest.0.4749 and type 100000 to the parent [0x200101166:0x1730:0x0]: rc = 1
      00100000:10000000:6.0F:1583880053.764322:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14685:0x0] with the name file.mdtest.25.4668 and type 100000 to the parent [0x200101166:0x1721:0x0]: rc = 1
      00100000:10000000:7.0:1583880053.830815:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14686:0x0] with the name file.mdtest.26.4699 and type 100000 to the parent [0x200101166:0x1725:0x0]: rc = 1
      00100000:10000000:9.0:1583880053.963978:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14698:0x0] with the name file.mdtest.22.4696 and type 100000 to the parent [0x200101166:0x1716:0x0]: rc = 1
      
      

      Attachments

        Activity

          [LU-13389] aborted file create may leave unattached inodes on MDS.
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38221/
          Subject: LU-13389 tgt: not rollback obsolete rq
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: e87e30460b18e73d1dbf2627e59485ae3d670e60

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38221/ Subject: LU-13389 tgt: not rollback obsolete rq Project: fs/lustre-release Branch: master Current Patch Set: Commit: e87e30460b18e73d1dbf2627e59485ae3d670e60

          Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/38221
          Subject: LU-13389 tgt: not rollback obsolete rq
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 5c494fb023b934c24863dd76f3d2dc316c9fc687

          gerrit Gerrit Updater added a comment - Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/38221 Subject: LU-13389 tgt: not rollback obsolete rq Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5c494fb023b934c24863dd76f3d2dc316c9fc687

          The issue is a regression from LU-11444:

          A check for "obsolete" requests introduced in LU-11444 .
          server code detects such requests by searching for in-progress requests for the same multimod rpc slot. The ones with lower XIDs occupying the same slot get marked "obsolete" b/c the application sent the request was interrupted and the slot was reused.
          Processing of "obsoleted" requests is done as usual, but last_rcvd file is not updated.

          The "rq_obsolete" flag causes osd_txn_stop() to return error

          tgt_add_reply_data():
          ...
                          if (req->rq_obsolete) {
                                  mutex_unlock(&ted->ted_lcd_lock);
                                  RETURN(-EALREADY);
                          }
          
          

          full call stack looks as

          • osd_trans_stop()
          • dt_txn_hook_stop()
          • tgt_txn_stop_cb()
          • tgt_last_rcvd_update()
          • tgt_mk_reply_data()
          • tgt_add_reply_data()
          • RETURN(-EALREADY);

          and mdd_create() attempts to rollback the transaction:

          out_stop:
                  rc2 = mdd_trans_stop(env, mdd, rc, handle);
                  if (rc == 0) {
                          /* If creation fails, it is most likely due to the remote update
                           * failure, because local transaction will mostly succeed at
                           * this stage. There is no easy way to rollback all of previous
                           * updates, so let's remove the object from namespace, and
                           * LFSCK should handle the orphan object. */
                          if (rc2 < 0 && !mdd_object_remote(mdd_pobj))
                                  mdd_index_delete(env, mdd_pobj, attr, lname);
                          rc = rc2;
                  }
          
          

          Unfortunately it is done partially, only name is removed. There are two bad side effects from that: fs inconsistency with unattached inodes and LFSCK is able to re-attach the inodes back to the namespace which is a bit unexpected.

          I should note those "obsolete" requests may not survive a failover as the client already dropped the request from all import's queues. It means the request cannot participate in recovery and can be lost after a failover. So transaction rollback doesn't look as a wrong approach, it reduces chances of seeing fs object loss after failover. Also nobody depends on successful completion of the file create, the application already aborted.

          I see several possible solutions to the problem:

          1. rollback the transaction fully, that includes inode deletion and changelog record update or deletion.
          2. leaving the file in the namespace accepting that it can be lost after a failover, it worked that way before LUS-7339 / LUS-6272 and caused no problems, but I might be wrong here.
          3. leaving the inode unattached and somehow avoiding re-attaching them by LFSCK (by removal of LinkEA record ?).

          zam Alexander Zarochentsev added a comment - The issue is a regression from LU-11444 : A check for "obsolete" requests introduced in LU-11444 . server code detects such requests by searching for in-progress requests for the same multimod rpc slot. The ones with lower XIDs occupying the same slot get marked "obsolete" b/c the application sent the request was interrupted and the slot was reused. Processing of "obsoleted" requests is done as usual, but last_rcvd file is not updated. The "rq_obsolete" flag causes osd_txn_stop() to return error tgt_add_reply_data(): ... if (req->rq_obsolete) { mutex_unlock(&ted->ted_lcd_lock); RETURN(-EALREADY); } full call stack looks as osd_trans_stop() dt_txn_hook_stop() tgt_txn_stop_cb() tgt_last_rcvd_update() tgt_mk_reply_data() tgt_add_reply_data() RETURN(-EALREADY); and mdd_create() attempts to rollback the transaction: out_stop: rc2 = mdd_trans_stop(env, mdd, rc, handle); if (rc == 0) { /* If creation fails, it is most likely due to the remote update * failure, because local transaction will mostly succeed at * this stage. There is no easy way to rollback all of previous * updates, so let's remove the object from namespace, and * LFSCK should handle the orphan object. */ if (rc2 < 0 && !mdd_object_remote(mdd_pobj)) mdd_index_delete(env, mdd_pobj, attr, lname); rc = rc2; } Unfortunately it is done partially, only name is removed. There are two bad side effects from that: fs inconsistency with unattached inodes and LFSCK is able to re-attach the inodes back to the namespace which is a bit unexpected. I should note those "obsolete" requests may not survive a failover as the client already dropped the request from all import's queues. It means the request cannot participate in recovery and can be lost after a failover. So transaction rollback doesn't look as a wrong approach, it reduces chances of seeing fs object loss after failover. Also nobody depends on successful completion of the file create, the application already aborted. I see several possible solutions to the problem: 1. rollback the transaction fully, that includes inode deletion and changelog record update or deletion. 2. leaving the file in the namespace accepting that it can be lost after a failover, it worked that way before LUS-7339 / LUS-6272 and caused no problems, but I might be wrong here. 3. leaving the inode unattached and somehow avoiding re-attaching them by LFSCK (by removal of LinkEA record ?).

          People

            zam Alexander Zarochentsev
            zam Alexander Zarochentsev
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: