[LU-13389] aborted file create may leave unattached inodes on MDS. Created: 25/Mar/20  Updated: 07/May/20  Resolved: 07/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

aborted mdtest job leaves unattached inodes:

Job Script: command started at Mon Mar 2 16:40:31 CST 2020

-- started at 03/02/2020 16:40:32 --

 

mdtest-1.9.3 was launched with 32 total task(s) on 1 node(s)

Command line used: /cray/css/ostest/binaries/xt/rel.70.aries.cray/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/mdtest/mdtest -f 32 -l 32 -n10000 -i1 -d /lus/snx11281/disk/ostest.vers/alsorun.20200302130005.752.saturn-p4/CL_mdtest_4s_fo.4.tjgw4U.1583188829/CL_mdtest_4s_fo -tuv

V-1: main: Setting create/stat/read/remove_only to True

V-1: Entering valid_tests...

barriers                : True

collective_creates      : False

create_only             : True

dirpath(s):

    /lus/snx11281/disk/ostest.vers/alsorun.20200302130005.752.saturn-p4/CL_mdtest_4s_fo.4.tjgw4U.1583188829/CL_mdtest_4s_fo

dirs_only               : True

read_bytes              : 0

read_only               : True

first                   : 32

files_only              : True

iterations              : 1

items_per_dir           : 0

last                    : 32

leaf_only               : False

items                   : 10000

nstride                 : 0

pre_delay               : 0

remove_only             : False

random_seed             : 0

stride                  : 1

shared_file             : False

time_unique_dir_overhead: True

stat_only               : True

unique_dir_per_task     : True

write_bytes             : 0

sync_file               : False

depth                   : 0

V-1: Entering display_freespace...

V-1: Entering show_file_system_size...

Path: /lus/snx11281/disk/ostest.vers/alsorun.20200302130005.752.saturn-p4/CL_mdtest_4s_fo.4.tjgw4U.1583188829

FS: 483.4 TiB   Used FS: 1.5%   Inodes: 487.0 Mi   Used Inodes: 0.1%

 

32 tasks, 320000 files/directories

 

   Operation               Duration              Rate

   ---------               --------              ----

V-1: main: * iteration 1 *

V-1: Entering create_remove_directory_tree, currDepth = 0...

V-1: Entering create_remove_directory_tree, currDepth = 1...

V-1: main:   Tree creation     :          0.011 sec,         88.658 ops/sec

V-1: Entering directory_test...

V-1: Entering unique_dir_access...

V-1: Entering create_remove_items, currDepth = 0...

V-1: Entering create_remove_items_helper...

V-1: Entering unique_dir_access...

V-1: Entering mdtest_stat...

V-1: Entering unique_dir_access...

V-1: Entering unique_dir_access...

V-1: Entering create_remove_items, currDepth = 0...

V-1: Entering create_remove_items_helper...

V-1: Entering unique_dir_access...

V-1:   Directory creation:         25.644 sec,      12478.741 ops/sec

V-1:   Directory stat    :          4.194 sec,      76299.371 ops/sec

V-1:   Directory removal :          8.918 sec,      35883.016 ops/sec

V-1: Entering file_test...

V-1: Entering unique_dir_access...

V-1: Entering create_remove_items, currDepth = 0...

V-1: Entering create_remove_items_helper...

aprun: Apid 5441901: Caught signal Terminated, sending to application

_pmiu_daemon(SIGCHLD): [NID 00545] [c0-1c2s8n1] [Mon Mar  2 18:15:52 2020] PE RANK 25 exit signal Terminated

Application 5441901 exit codes: 143

Application 5441901 resources: utime ~0s, stime ~6s, Rss ~11768, inblocks ~0, outblocks ~0

Job Script: command stopped at Mon Mar 2 18:16:38 CST 2020

Job Script: command runtime was 5767 seconds

e2fsck logs:

[root@snx11281n000 ~]# grep Unattached /home/admin/e2fsck.*
/home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056499
/home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056500
/home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056501
/home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056507
/home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056509
/home/admin/e2fsck.snx11281n002.3.4-010.81.202003091157.out:Unattached inode 3367056510
[root@snx11281n000 ~]#

unattached inode stat:

debugfs:  stat <3367056499>

Inode: 3367056499   Type: regular    Mode:  0644   Flags: 0x0

Generation: 3271840411    Version: 0x000001ee:00bad136

User:  1356   Group: 11121   Project:     0   Size: 0

File ACL: 0

Links: 1   Blockcount: 0

Fragment:  Address: 0    Number: 0    Size: 0

 ctime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

 atime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

 mtime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

crtime: 0x5e5da1e3:c1b98cf8 -- Mon Mar  2 18:16:35 2020

Size of extra inode fields: 32

Extended attributes:

  trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 84 46 01 00 00 00 00 00

  lma: fid=[0x20010116e:0x14684:0x0] compat=0 incompat=0

  trusted.lov (144)

  trusted.link (60)

BLOCKS:



debugfs:  stat <3367056500>

Inode: 3367056500   Type: regular    Mode:  0644   Flags: 0x0

Generation: 3271840414    Version: 0x000001ee:00bad13a

User:  1356   Group: 11121   Project:     0   Size: 0

File ACL: 0

Links: 1   Blockcount: 0

Fragment:  Address: 0    Number: 0    Size: 0

 ctime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

 atime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

 mtime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

crtime: 0x5e5da1e3:c1b98cf8 -- Mon Mar  2 18:16:35 2020

Size of extra inode fields: 32

Extended attributes:

  trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 85 46 01 00 00 00 00 00

  lma: fid=[0x20010116e:0x14685:0x0] compat=0 incompat=0

  trusted.lov (144)

  trusted.link (61)

BLOCKS:

 


debugfs:  stat <3367056501>

Inode: 3367056501   Type: regular    Mode:  0644   Flags: 0x0

Generation: 3271840413    Version: 0x000001ee:00bad143

User:  1356   Group: 11121   Project:     0   Size: 0

File ACL: 0

Links: 1   Blockcount: 0

Fragment:  Address: 0    Number: 0    Size: 0

 ctime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

 atime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

 mtime: 0x5e5d9f91:00000000 -- Mon Mar  2 18:06:41 2020

crtime: 0x5e5da1e3:c1b98cf8 -- Mon Mar  2 18:16:35 2020

Size of extra inode fields: 32

Extended attributes:

  trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 83 46 01 00 00 00 00 00

  lma: fid=[0x20010116e:0x14683:0x0] compat=0 incompat=0

  trusted.lov (96)

  trusted.link (61)

BLOCKS:

 


debugfs:  stat <3367056507>

Inode: 3367056507   Type: regular    Mode:  0644   Flags: 0x0

Generation: 3271840428    Version: 0x000001ee:00bad146

User:  1356   Group: 11121   Project:     0   Size: 0

File ACL: 0

Links: 1   Blockcount: 0

Fragment:  Address: 0    Number: 0    Size: 0

 ctime: 0x5e5d9f96:00000000 -- Mon Mar  2 18:06:46 2020

 atime: 0x5e5d9f96:00000000 -- Mon Mar  2 18:06:46 2020

 mtime: 0x5e5d9f96:00000000 -- Mon Mar  2 18:06:46 2020

crtime: 0x5e5da1e3:c1f69600 -- Mon Mar  2 18:16:35 2020

Size of extra inode fields: 32

Extended attributes:

  trusted.lma (24) = 00 00 00 00 00 00 00 00 6e 11 10 00 02 00 00 00 98 46 01 00 00 00 00 00

  lma: fid=[0x20010116e:0x14698:0x0] compat=0 incompat=0

  trusted.lov (144)

  trusted.link (61)

BLOCKS:

 

dmesg contains the following Lustre error message at the moment of inode creation:

Mar  2 18:16:35 snx11281n002 kernel: LustreError: 2640:0:(osd_handler.c:2009:osd_trans_stop()) snx11281-MDT0000: failed in transaction hook: rc = -114

LFSCK was able to successfully re-attach the inodes:

00100000:10000000:1.0:1583880053.552740:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14682:0x0] with the name file.mdtest.17.4708 and type 100000 to the parent [0x200101166:0x1731:0x0]: rc = 1
00100000:10000000:1.0:1583880053.626588:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14683:0x0] with the name file.mdtest.11.4748 and type 100000 to the parent [0x200101166:0x1727:0x0]: rc = 1
00100000:10000000:9.0:1583880053.699071:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14684:0x0] with the name file.mdtest.0.4749 and type 100000 to the parent [0x200101166:0x1730:0x0]: rc = 1
00100000:10000000:6.0F:1583880053.764322:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14685:0x0] with the name file.mdtest.25.4668 and type 100000 to the parent [0x200101166:0x1721:0x0]: rc = 1
00100000:10000000:7.0:1583880053.830815:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14686:0x0] with the name file.mdtest.26.4699 and type 100000 to the parent [0x200101166:0x1725:0x0]: rc = 1
00100000:10000000:9.0:1583880053.963978:0:31921:0:(lfsck_namespace.c:1243:lfsck_namespace_insert_normal()) snx11281-MDT0000-osd: namespace LFSCK insert object [0x20010116e:0x14698:0x0] with the name file.mdtest.22.4696 and type 100000 to the parent [0x200101166:0x1716:0x0]: rc = 1



 Comments   
Comment by Alexander Zarochentsev [ 25/Mar/20 ]

The issue is a regression from LU-11444:

A check for "obsolete" requests introduced in LU-11444 .
server code detects such requests by searching for in-progress requests for the same multimod rpc slot. The ones with lower XIDs occupying the same slot get marked "obsolete" b/c the application sent the request was interrupted and the slot was reused.
Processing of "obsoleted" requests is done as usual, but last_rcvd file is not updated.

The "rq_obsolete" flag causes osd_txn_stop() to return error

tgt_add_reply_data():
...
                if (req->rq_obsolete) {
                        mutex_unlock(&ted->ted_lcd_lock);
                        RETURN(-EALREADY);
                }

full call stack looks as

  • osd_trans_stop()
  • dt_txn_hook_stop()
  • tgt_txn_stop_cb()
  • tgt_last_rcvd_update()
  • tgt_mk_reply_data()
  • tgt_add_reply_data()
  • RETURN(-EALREADY);

and mdd_create() attempts to rollback the transaction:

out_stop:
        rc2 = mdd_trans_stop(env, mdd, rc, handle);
        if (rc == 0) {
                /* If creation fails, it is most likely due to the remote update
                 * failure, because local transaction will mostly succeed at
                 * this stage. There is no easy way to rollback all of previous
                 * updates, so let's remove the object from namespace, and
                 * LFSCK should handle the orphan object. */
                if (rc2 < 0 && !mdd_object_remote(mdd_pobj))
                        mdd_index_delete(env, mdd_pobj, attr, lname);
                rc = rc2;
        }

Unfortunately it is done partially, only name is removed. There are two bad side effects from that: fs inconsistency with unattached inodes and LFSCK is able to re-attach the inodes back to the namespace which is a bit unexpected.

I should note those "obsolete" requests may not survive a failover as the client already dropped the request from all import's queues. It means the request cannot participate in recovery and can be lost after a failover. So transaction rollback doesn't look as a wrong approach, it reduces chances of seeing fs object loss after failover. Also nobody depends on successful completion of the file create, the application already aborted.

I see several possible solutions to the problem:

1. rollback the transaction fully, that includes inode deletion and changelog record update or deletion.
2. leaving the file in the namespace accepting that it can be lost after a failover, it worked that way before LUS-7339 / LUS-6272 and caused no problems, but I might be wrong here.
3. leaving the inode unattached and somehow avoiding re-attaching them by LFSCK (by removal of LinkEA record ?).

Comment by Gerrit Updater [ 14/Apr/20 ]

Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/38221
Subject: LU-13389 tgt: not rollback obsolete rq
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5c494fb023b934c24863dd76f3d2dc316c9fc687

Comment by Gerrit Updater [ 07/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38221/
Subject: LU-13389 tgt: not rollback obsolete rq
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e87e30460b18e73d1dbf2627e59485ae3d670e60

Comment by Peter Jones [ 07/May/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:00:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.