[LU-11668] mdd_parent_fid()) ASSERTION( (((mdd_object_type(obj)) & 00170000) == 0040000) ) failed Created: 14/Nov/18 Updated: 01/Mar/20 Resolved: 29/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.14.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
I hit this assertion in current master-next testing but I don't see anything obvious included that would lead to it so perhaps it's some rare race that just happened to happen? [ 6095.328424] Lustre: DEBUG MARKER: == racer test 1: racer on clients: centos-30.localnet DURATION=2700 ================================== 02:51:44 (1542181904) [ 6097.825252] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000403:0x5:0x0], use llapi_layout_get_by_path() [ 6101.235171] Lustre: DEBUG MARKER: racer test_1: @@@@@@ FAIL: generate lss conf (mds1) [ 6106.472165] LustreError: 4856:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 448 for EA 496 (max_mdsize 496): rc = -34 [ 6108.575804] LustreError: 26511:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0001: small buffer size 448 for EA 472 (max_mdsize 472): rc = -34 [ 6361.959073] 9[28537]: segfault at 8 ip 00007f20a23dc958 sp 00007fffccffcf80 error 4 in ld-2.17.so[7f20a23d1000+22000] [ 6469.162820] LustreError: 26494:0:(mdd_dir.c:222:mdd_parent_fid()) ASSERTION( (((mdd_object_type(obj)) & 00170000) == 0040000) ) failed: [ 6469.214647] LustreError: 26494:0:(mdd_dir.c:222:mdd_parent_fid()) LBUG [ 6469.215925] Pid: 26494, comm: mdt00_001 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018 [ 6469.219120] Call Trace: [ 6469.222463] [<ffffffffa02637dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 6469.250486] [<ffffffffa026388c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 6469.251895] [<ffffffffa100ef22>] mdd_is_parent+0x4d2/0x510 [mdd] [ 6469.253469] [<ffffffffa100f164>] mdd_is_subdir+0x204/0x240 [mdd] [ 6469.315072] [<ffffffffa108f8a0>] mdt_reint_rename_internal.isra.47+0x810/0x2750 [mdt] [ 6469.318228] [<ffffffffa109689b>] mdt_reint_rename_or_migrate.isra.51+0x19b/0x860 [mdt] [ 6469.340401] [<ffffffffa1096f93>] mdt_reint_rename+0x13/0x20 [mdt] [ 6469.358495] [<ffffffffa10984f0>] mdt_reint_rec+0x80/0x210 [mdt] [ 6469.400446] [<ffffffffa1075882>] mdt_reint_internal+0x6b2/0xa50 [mdt] [ 6469.405016] [<ffffffffa1080997>] mdt_reint+0x67/0x140 [mdt] [ 6469.406310] [<ffffffffa05c3365>] tgt_request_handle+0xaf5/0x1590 [ptlrpc] [ 6469.412532] [<ffffffffa0567436>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc] [ 6469.415111] [<ffffffffa056b329>] ptlrpc_main+0xa99/0x1f60 [ptlrpc] [ 6469.416569] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [ 6469.417870] [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39 [ 6469.419494] [<ffffffffffffffff>] 0xffffffffffffffff [ 6469.420822] Kernel panic - not syncing: LBUG crashdump: 192.168.123.130-2018-11-14-02:58:09 git source: 46bcdb588e22abf162af9a486107c7b59b438dd2 |
| Comments |
| Comment by Oleg Drokin [ 19/Nov/18 ] |
|
hit this twice more so far so it does appear to be a recent regression |
| Comment by Peter Jones [ 20/Nov/18 ] |
|
Lai Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 21/Nov/18 ] |
|
It may just be that this is a case of rename being called on a regular file and the MDS not verifying that the "parent" is a directory before diving into the code? This should probably be verified early on in RPC handing like mdt_reint_rename_internal() (or mdt_reint_rename_or_migrate() if we don't want to ever allow migrating regular files), but I couldn't find any checks like that. |
| Comment by Andreas Dilger [ 21/Nov/18 ] |
|
It would also be useful to improve the assertion here to tell us what the actual file type is: @@ -219,7 +219,10 @@ static inline int mdd_parent_fid(const struct lu_env *env, ENTRY; - LASSERT(S_ISDIR(mdd_object_type(obj))); + LASSERTF(S_ISDIR(mdd_object_type(obj)), + "%s: FID "DFID" is not a directory type = %o\n", + mdd_obj_dev_name(obj), PFID(mdd_object_fid(obj)), + mdd_object_type(obj)); buf = lu_buf_check_and_alloc(buf, PATH_MAX); if (buf->lb_buf == NULL) |
| Comment by Gerrit Updater [ 21/Nov/18 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33700 |
| Comment by Lai Siyao [ 21/Nov/18 ] |
|
I don't find any clue from the code, let's see what type this object is. Since the parent FID is read from disk, and system may be inconsistent, in the future we may turn this assert into a check and return error if it's not directory. |
| Comment by Andreas Dilger [ 22/Nov/18 ] |
|
I think it makes sense to just check the parent type in the MDT code, since there could be all kinds of reasons that it is wrong. In this case, it is likely that racer moved or deleted a directory that the client was going to rename a file in, and another thread created a regular file in its place with the same name. The mdt_reint_rename_internal() code should just check the type after the parent is looked up, and return -ENOTDIR if it isn't a directory. It may be the best place for that is in mdt_object_find_check() since that is only called for parent directories, in which case it would be better to be renamed as mdt_parent_find_check() or similar. Could you please work on a patch today, as this is one of the last blockers for 2.12 that doesn't have a patch yet. |
| Comment by Gerrit Updater [ 22/Nov/18 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33709 |
| Comment by Gerrit Updater [ 29/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33700/ |
| Comment by Gerrit Updater [ 29/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33709/ |
| Comment by Peter Jones [ 29/Nov/18 ] |
|
Landed for 2.12 |
| Comment by Oleg Drokin [ 18/Jan/19 ] |
|
I hit this in current master-next with the new debug print. Running racer: [ 5609.701558] LustreError: 29511:0:(mdd_dir.c:225:mdd_parent_fid()) ASSERTION( S_ISDIR(mdd_object_type(obj)) ) failed: lustre-MDD0000: FID [0x200000003:0xa:0x0] is not a directory type = 100000 [ 5609.713377] LustreError: 29511:0:(mdd_dir.c:225:mdd_parent_fid()) LBUG [ 5609.714440] Pid: 29511, comm: mdt07_012 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018 [ 5609.716491] Call Trace: [ 5609.717566] [<ffffffffa02077dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 5609.719016] [<ffffffffa020788c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 5609.722670] [<ffffffffa0fe5ab4>] mdd_parent_fid+0x374/0x3b0 [mdd] [ 5609.724305] [<ffffffffa0fe5bc0>] mdd_is_parent+0xd0/0x1a0 [mdd] [ 5609.725641] [<ffffffffa0fe5e94>] mdd_is_subdir+0x204/0x240 [mdd] [ 5609.726669] [<ffffffffa10642d0>] mdt_reint_rename_internal.isra.46+0x810/0x2750 [mdt] [ 5609.728468] [<ffffffffa106e14b>] mdt_reint_rename_or_migrate.isra.51+0x19b/0x860 [mdt] [ 5609.730274] [<ffffffffa106e843>] mdt_reint_rename+0x13/0x20 [mdt] [ 5609.731149] [<ffffffffa106e8d0>] mdt_reint_rec+0x80/0x210 [mdt] [ 5609.732097] [<ffffffffa104b723>] mdt_reint_internal+0x6e3/0xab0 [mdt] [ 5609.732988] [<ffffffffa10568e7>] mdt_reint+0x67/0x140 [mdt] [ 5609.734283] [<ffffffffa05f5605>] tgt_request_handle+0xaf5/0x1590 [ptlrpc] [ 5609.735808] [<ffffffffa05993a9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc] [ 5609.737741] [<ffffffffa059d36c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc] [ 5609.738705] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [ 5609.739563] [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39 [ 5609.740589] [<ffffffffffffffff>] 0xffffffffffffffff [ 5609.741650] Kernel panic - not syncing: LBUG |
| Comment by Gerrit Updater [ 03/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35047 |
| Comment by Gerrit Updater [ 01/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35047/ |