[LU-11668] mdd_parent_fid()) ASSERTION( (((mdd_object_type(obj)) & 00170000) == 0040000) ) failed Created: 14/Nov/18  Updated: 01/Mar/20  Resolved: 29/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0, Lustre 2.14.0

Type: Improvement Priority: Major
Reporter: Oleg Drokin Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

I hit this assertion in current master-next testing but I don't see anything obvious included that would lead to it so perhaps it's some rare race that just happened to happen?

 [ 6095.328424] Lustre: DEBUG MARKER: == racer test 1: racer on clients: centos-30.localnet DURATION=2700 ================================== 02:51:44 (1542181904)
 [ 6097.825252] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000403:0x5:0x0], use llapi_layout_get_by_path()
 [ 6101.235171] Lustre: DEBUG MARKER: racer test_1: @@@@@@ FAIL: generate lss conf (mds1)
 [ 6106.472165] LustreError: 4856:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 448 for EA 496 (max_mdsize 496): rc = -34
 [ 6108.575804] LustreError: 26511:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0001: small buffer size 448 for EA 472 (max_mdsize 472): rc = -34
 [ 6361.959073] 9[28537]: segfault at 8 ip 00007f20a23dc958 sp 00007fffccffcf80 error 4 in ld-2.17.so[7f20a23d1000+22000]
 [ 6469.162820] LustreError: 26494:0:(mdd_dir.c:222:mdd_parent_fid()) ASSERTION( (((mdd_object_type(obj)) & 00170000) == 0040000) ) failed: 
 [ 6469.214647] LustreError: 26494:0:(mdd_dir.c:222:mdd_parent_fid()) LBUG
 [ 6469.215925] Pid: 26494, comm: mdt00_001 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018
 [ 6469.219120] Call Trace:
 [ 6469.222463]  [<ffffffffa02637dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
 [ 6469.250486]  [<ffffffffa026388c>] lbug_with_loc+0x4c/0xa0 [libcfs]
 [ 6469.251895]  [<ffffffffa100ef22>] mdd_is_parent+0x4d2/0x510 [mdd]
 [ 6469.253469]  [<ffffffffa100f164>] mdd_is_subdir+0x204/0x240 [mdd]
 [ 6469.315072]  [<ffffffffa108f8a0>] mdt_reint_rename_internal.isra.47+0x810/0x2750 [mdt]
 [ 6469.318228]  [<ffffffffa109689b>] mdt_reint_rename_or_migrate.isra.51+0x19b/0x860 [mdt]
 [ 6469.340401]  [<ffffffffa1096f93>] mdt_reint_rename+0x13/0x20 [mdt]
 [ 6469.358495]  [<ffffffffa10984f0>] mdt_reint_rec+0x80/0x210 [mdt]
 [ 6469.400446]  [<ffffffffa1075882>] mdt_reint_internal+0x6b2/0xa50 [mdt]
 [ 6469.405016]  [<ffffffffa1080997>] mdt_reint+0x67/0x140 [mdt]
 [ 6469.406310]  [<ffffffffa05c3365>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
 [ 6469.412532]  [<ffffffffa0567436>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc]
 [ 6469.415111]  [<ffffffffa056b329>] ptlrpc_main+0xa99/0x1f60 [ptlrpc]
 [ 6469.416569]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
 [ 6469.417870]  [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39
 [ 6469.419494]  [<ffffffffffffffff>] 0xffffffffffffffff
 [ 6469.420822] Kernel panic - not syncing: LBUG

crashdump: 192.168.123.130-2018-11-14-02:58:09 git source: 46bcdb588e22abf162af9a486107c7b59b438dd2



 Comments   
Comment by Oleg Drokin [ 19/Nov/18 ]

hit this twice more so far so it does appear to be a recent regression

Comment by Peter Jones [ 20/Nov/18 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 21/Nov/18 ]

It may just be that this is a case of rename being called on a regular file and the MDS not verifying that the "parent" is a directory before diving into the code? This should probably be verified early on in RPC handing like mdt_reint_rename_internal() (or mdt_reint_rename_or_migrate() if we don't want to ever allow migrating regular files), but I couldn't find any checks like that.

Comment by Andreas Dilger [ 21/Nov/18 ]

It would also be useful to improve the assertion here to tell us what the actual file type is:

@@ -219,7 +219,10 @@ static inline int mdd_parent_fid(const struct lu_env *env,
 
        ENTRY;
 
-       LASSERT(S_ISDIR(mdd_object_type(obj)));
+       LASSERTF(S_ISDIR(mdd_object_type(obj)),
+                "%s: FID "DFID" is not a directory type = %o\n",
+                mdd_obj_dev_name(obj), PFID(mdd_object_fid(obj)),
+                mdd_object_type(obj));
 
        buf = lu_buf_check_and_alloc(buf, PATH_MAX);
        if (buf->lb_buf == NULL)
Comment by Gerrit Updater [ 21/Nov/18 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33700
Subject: LU-11668 debug: print object type in mdd_parent_fid
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4c91e284d998de40fd08bb18e046fb23bd0044d

Comment by Lai Siyao [ 21/Nov/18 ]

I don't find any clue from the code, let's see what type this object is. Since the parent FID is read from disk, and system may be inconsistent, in the future we may turn this assert into a check and return error if it's not directory.

Comment by Andreas Dilger [ 22/Nov/18 ]

I think it makes sense to just check the parent type in the MDT code, since there could be all kinds of reasons that it is wrong. In this case, it is likely that racer moved or deleted a directory that the client was going to rename a file in, and another thread created a regular file in its place with the same name. The mdt_reint_rename_internal() code should just check the type after the parent is looked up, and return -ENOTDIR if it isn't a directory.

It may be the best place for that is in mdt_object_find_check() since that is only called for parent directories, in which case it would be better to be renamed as mdt_parent_find_check() or similar.

Could you please work on a patch today, as this is one of the last blockers for 2.12 that doesn't have a patch yet.

Comment by Gerrit Updater [ 22/Nov/18 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33709
Subject: LU-11668 mdt: check parent type in rename/migrate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3ede5bf5d6539ac2a8ca4831722bff367d1aed68

Comment by Gerrit Updater [ 29/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33700/
Subject: LU-11668 debug: print object type in mdd_parent_fid
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9a4a99b81a267a098b92ec10991af27b7f3cae7e

Comment by Gerrit Updater [ 29/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33709/
Subject: LU-11668 mdt: check parent type in rename/migrate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 214b12adc315c4adc3c56deb7e790fdc6f0095c8

Comment by Peter Jones [ 29/Nov/18 ]

Landed for 2.12

Comment by Oleg Drokin [ 18/Jan/19 ]

I hit this in current master-next with the new debug print. Running racer:

[ 5609.701558] LustreError: 29511:0:(mdd_dir.c:225:mdd_parent_fid()) ASSERTION( S_ISDIR(mdd_object_type(obj)) ) failed: lustre-MDD0000: FID [0x200000003:0xa:0x0] is not a directory type = 100000
[ 5609.713377] LustreError: 29511:0:(mdd_dir.c:225:mdd_parent_fid()) LBUG
[ 5609.714440] Pid: 29511, comm: mdt07_012 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018
[ 5609.716491] Call Trace:
[ 5609.717566]  [<ffffffffa02077dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 5609.719016]  [<ffffffffa020788c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 5609.722670]  [<ffffffffa0fe5ab4>] mdd_parent_fid+0x374/0x3b0 [mdd]
[ 5609.724305]  [<ffffffffa0fe5bc0>] mdd_is_parent+0xd0/0x1a0 [mdd]
[ 5609.725641]  [<ffffffffa0fe5e94>] mdd_is_subdir+0x204/0x240 [mdd]
[ 5609.726669]  [<ffffffffa10642d0>] mdt_reint_rename_internal.isra.46+0x810/0x2750 [mdt]
[ 5609.728468]  [<ffffffffa106e14b>] mdt_reint_rename_or_migrate.isra.51+0x19b/0x860 [mdt]
[ 5609.730274]  [<ffffffffa106e843>] mdt_reint_rename+0x13/0x20 [mdt]
[ 5609.731149]  [<ffffffffa106e8d0>] mdt_reint_rec+0x80/0x210 [mdt]
[ 5609.732097]  [<ffffffffa104b723>] mdt_reint_internal+0x6e3/0xab0 [mdt]
[ 5609.732988]  [<ffffffffa10568e7>] mdt_reint+0x67/0x140 [mdt]
[ 5609.734283]  [<ffffffffa05f5605>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
[ 5609.735808]  [<ffffffffa05993a9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
[ 5609.737741]  [<ffffffffa059d36c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc]
[ 5609.738705]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[ 5609.739563]  [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39
[ 5609.740589]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 5609.741650] Kernel panic - not syncing: LBUG
Comment by Gerrit Updater [ 03/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35047
Subject: LU-11668 mdd: use mdd_object_fid() instead of mdo2fid()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cfa0e83b83cca5b47c005d3934ee1abba07313ba

Comment by Gerrit Updater [ 01/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35047/
Subject: LU-11668 mdd: use mdd_object_fid() instead of mdo2fid()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7de9babe6f9af6dfdb20360211f8ecea344b0500

Generated at Sat Feb 10 02:45:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.