Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
It looks like lmv_revalidate_slaves() does not validate the fid it receives in the loop (unlike many other places that check with fid_is_sane) and calls mdc_intent_lock on a zero fid causing the assertion to hit.
Now it look the problem is deeper somewhere since we even get said zero fid assigned somewhere, though I am not sure where. The trace below is from racer on latest master
[ 2289.873955] LustreError: 27285:0:(file.c:5018:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000402:0x1:0x0] error: rc = -5 [ 2289.879247] LustreError: 27285:0:(file.c:5018:ll_inode_revalidate_fini()) Skipped 407 previous similar messages [ 2290.097986] LustreError: 27109:0:(vvp_io.c:1793:vvp_io_init()) lustre: refresh file layout [0x200000406:0xd650:0x0] error -108. [ 2293.106030] LustreError: 28772:0:(llite_nfs.c:342:ll_dir_get_parent_fid()) lustre: failure inode [0x240000409:0x4bab:0x0] get parent: rc = -116 [ 2293.108749] LustreError: 28772:0:(llite_nfs.c:342:ll_dir_get_parent_fid()) Skipped 29 previous similar messages [ 2303.467231] Lustre: 14085:0:(client.c:2285:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1626631396/real 1626631396] req@ffff8800be874280 x1705644635655488/t0(0) o36->lustre-MDT0000-mdc-ffff8800d6af9800@192.168.201.101@tcp:12/10 lens 496/440 e 0 to 1 dl 1626631525 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'setfattr.0' [ 2303.473349] Lustre: 14085:0:(client.c:2285:ptlrpc_expire_one_request()) Skipped 14 previous similar messages [ 2377.306136] LustreError: 13927:0:(llite_lib.c:1674:ll_update_lsm_md()) lustre: [0x200000404:0x1ea58:0x0] dir layout mismatch: [ 2377.308485] LustreError: 13927:0:(lustre_lmv.h:134:lsm_md_dump()) magic 0xcd20cd0 stripe count 1 master mdt 0 hash type crush:0x2000003 max inherit 0 version 1 migrate offset 0 migrate hash 0x0 pool [ 2377.311901] LustreError: 13927:0:(lustre_lmv.h:141:lsm_md_dump()) stripe[0] [0x200000400:0x529:0x0] [ 2377.313845] LustreError: 13927:0:(lustre_lmv.h:134:lsm_md_dump()) magic 0xcd20cd0 stripe count 1 master mdt 0 hash type crush:0x2000003 max inherit 0 version 1 migrate offset 0 migrate hash 0x0 pool [ 2377.317112] LustreError: 13927:0:(lustre_lmv.h:141:lsm_md_dump()) stripe[0] [0x0:0x0:0x0] [ 2377.318945] LustreError: 13927:0:(llite_lib.c:2955:ll_prep_inode()) new_inode -fatal: rc -22 [ 2377.320628] LustreError: 13927:0:(llite_lib.c:2955:ll_prep_inode()) Skipped 19 previous similar messages [ 2377.320805] LustreError: 13967:0:(ldlm_resource.c:1498:ldlm_resource_get()) ASSERTION( name->name[0] != 0 ) failed: [ 2377.320815] LustreError: 13967:0:(ldlm_resource.c:1498:ldlm_resource_get()) LBUG [ 2377.320820] Pid: 13967, comm: ls 3.10.0-7.9-debug #1 SMP Mon Feb 1 17:33:41 EST 2021 [ 2377.320820] Call Trace: [ 2377.320914] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [ 2377.320924] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2377.321103] [<0>] ldlm_resource_get+0x7d9/0x940 [ptlrpc] [ 2377.321137] [<0>] ldlm_lock_create+0x55/0x9e0 [ptlrpc] [ 2377.321177] [<0>] ldlm_cli_enqueue+0xc8/0x9e0 [ptlrpc] [ 2377.321186] [<0>] mdc_enqueue_base+0x323/0x14a0 [mdc] [ 2377.321191] [<0>] mdc_intent_lock+0x135/0x570 [mdc] [ 2377.321213] [<0>] lmv_revalidate_slaves+0x452/0xbd0 [lmv] [ 2377.321216] [<0>] lmv_merge_attr+0x45/0x1b0 [lmv] [ 2377.321285] [<0>] ll_getattr_dentry+0x6c5/0x920 [lustre] [ 2377.321296] [<0>] ll_getattr+0x1e/0x20 [lustre] [ 2377.321300] [<0>] vfs_getattr+0x46/0x80 [ 2377.321301] [<0>] vfs_fstat+0x45/0x80 [ 2377.321303] [<0>] SYSC_newfstat+0x24/0x60 [ 2377.321305] [<0>] SyS_newfstat+0xe/0x10 [ 2377.321308] [<0>] system_call_fastpath+0x1f/0x24 [ 2377.321345] [<0>] 0xfffffffffffffffe [ 2377.321347] Kernel panic - not syncing: LBUG
But when I was searching for similar crashes, I also arrived at this https://testing.whamcloud.com/test_sets/4fe20054-a07f-11e9-9e3d-52540065bddc which is essentially the same crash only we arrived there in a bit of a different way - not because so "dir layout mismatch" seems to be a bit of a red herring and a fallout from earlier incorrect fid assignment too.