[LU-14862] Zero fid assignment post eviction followed by ASSERTION( name->name[0] != 0 ) failed Created: 19/Jul/21  Updated: 02/Sep/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like lmv_revalidate_slaves() does not validate the fid it receives in the loop (unlike many other places that check with fid_is_sane) and calls mdc_intent_lock on a zero fid causing the assertion to hit.

Now it look the problem is deeper somewhere since we even get said zero fid assigned somewhere, though I am not sure where. The trace below is from racer on latest master

 [ 2289.873955] LustreError: 27285:0:(file.c:5018:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000402:0x1:0x0] error: rc = -5
[ 2289.879247] LustreError: 27285:0:(file.c:5018:ll_inode_revalidate_fini()) Skipped 407 previous similar messages
[ 2290.097986] LustreError: 27109:0:(vvp_io.c:1793:vvp_io_init()) lustre: refresh file layout [0x200000406:0xd650:0x0] error -108.
[ 2293.106030] LustreError: 28772:0:(llite_nfs.c:342:ll_dir_get_parent_fid()) lustre: failure inode [0x240000409:0x4bab:0x0] get parent: rc = -116
[ 2293.108749] LustreError: 28772:0:(llite_nfs.c:342:ll_dir_get_parent_fid()) Skipped 29 previous similar messages
[ 2303.467231] Lustre: 14085:0:(client.c:2285:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1626631396/real 1626631396]  req@ffff8800be874280 x1705644635655488/t0(0) o36->lustre-MDT0000-mdc-ffff8800d6af9800@192.168.201.101@tcp:12/10 lens 496/440 e 0 to 1 dl 1626631525 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'setfattr.0'
[ 2303.473349] Lustre: 14085:0:(client.c:2285:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
[ 2377.306136] LustreError: 13927:0:(llite_lib.c:1674:ll_update_lsm_md()) lustre: [0x200000404:0x1ea58:0x0] dir layout mismatch:
[ 2377.308485] LustreError: 13927:0:(lustre_lmv.h:134:lsm_md_dump()) magic 0xcd20cd0 stripe count 1 master mdt 0 hash type crush:0x2000003 max inherit 0 version 1 migrate offset 0 migrate hash 0x0 pool 
[ 2377.311901] LustreError: 13927:0:(lustre_lmv.h:141:lsm_md_dump()) stripe[0] [0x200000400:0x529:0x0]
[ 2377.313845] LustreError: 13927:0:(lustre_lmv.h:134:lsm_md_dump()) magic 0xcd20cd0 stripe count 1 master mdt 0 hash type crush:0x2000003 max inherit 0 version 1 migrate offset 0 migrate hash 0x0 pool 
[ 2377.317112] LustreError: 13927:0:(lustre_lmv.h:141:lsm_md_dump()) stripe[0] [0x0:0x0:0x0]
[ 2377.318945] LustreError: 13927:0:(llite_lib.c:2955:ll_prep_inode()) new_inode -fatal: rc -22
[ 2377.320628] LustreError: 13927:0:(llite_lib.c:2955:ll_prep_inode()) Skipped 19 previous similar messages
[ 2377.320805] LustreError: 13967:0:(ldlm_resource.c:1498:ldlm_resource_get()) ASSERTION( name->name[0] != 0 ) failed: 
[ 2377.320815] LustreError: 13967:0:(ldlm_resource.c:1498:ldlm_resource_get()) LBUG
[ 2377.320820] Pid: 13967, comm: ls 3.10.0-7.9-debug #1 SMP Mon Feb 1 17:33:41 EST 2021
[ 2377.320820] Call Trace:
[ 2377.320914] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[ 2377.320924] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 2377.321103] [<0>] ldlm_resource_get+0x7d9/0x940 [ptlrpc]
[ 2377.321137] [<0>] ldlm_lock_create+0x55/0x9e0 [ptlrpc]
[ 2377.321177] [<0>] ldlm_cli_enqueue+0xc8/0x9e0 [ptlrpc]
[ 2377.321186] [<0>] mdc_enqueue_base+0x323/0x14a0 [mdc]
[ 2377.321191] [<0>] mdc_intent_lock+0x135/0x570 [mdc]
[ 2377.321213] [<0>] lmv_revalidate_slaves+0x452/0xbd0 [lmv]
[ 2377.321216] [<0>] lmv_merge_attr+0x45/0x1b0 [lmv]
[ 2377.321285] [<0>] ll_getattr_dentry+0x6c5/0x920 [lustre]
[ 2377.321296] [<0>] ll_getattr+0x1e/0x20 [lustre]
[ 2377.321300] [<0>] vfs_getattr+0x46/0x80
[ 2377.321301] [<0>] vfs_fstat+0x45/0x80
[ 2377.321303] [<0>] SYSC_newfstat+0x24/0x60
[ 2377.321305] [<0>] SyS_newfstat+0xe/0x10
[ 2377.321308] [<0>] system_call_fastpath+0x1f/0x24
[ 2377.321345] [<0>] 0xfffffffffffffffe
[ 2377.321347] Kernel panic - not syncing: LBUG

But when I was searching for similar crashes, I also arrived at this https://testing.whamcloud.com/test_sets/4fe20054-a07f-11e9-9e3d-52540065bddc which is essentially the same crash only we arrived there in a bit of a different way - not because so "dir layout mismatch" seems to be a bit of a red herring and a fallout from earlier incorrect fid assignment too.



 Comments   
Comment by Alex Zhuravlev [ 02/Sep/22 ]

I'm hitting this quite often in racer:

PID: 504341  TASK: ffff891905fb92c0  CPU: 0   COMMAND: "ls"
 #0 [ffff891897e4f790] panic at ffffffff820af786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff891897e4f820] ldlm_resource_get at ffffffffc0589652 [ptlrpc]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 156
 #2 [ffff891897e4f868] ldlm_lock_create at ffffffffc0581e76 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1671
 #3 [ffff891897e4f8a8] ldlm_cli_enqueue at ffffffffc05a1cbd [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c: 1008
 #4 [ffff891897e4f928] mdc_enqueue_base at ffffffffc09061e7 [mdc]
    /home/lustre/master-mine/lustre/mdc/mdc_locks.c: 1006
 #5 [ffff891897e4fa28] mdc_intent_lock at ffffffffc090856a [mdc]
    /home/lustre/master-mine/lustre/mdc/mdc_locks.c: 1360
 #6 [ffff891897e4fac8] lmv_revalidate_slaves at ffffffffc0815ebc [lmv]
    /home/lustre/master-mine/lustre/lmv/lmv_intent.c: 225
 #7 [ffff891897e4fb88] lmv_merge_attr at ffffffffc080925d [lmv]
    /home/lustre/master-mine/lustre/lmv/lmv_obd.c: 3827
 #8 [ffff891897e4fbc0] ll_update_inode at ffffffffc0f7189a [lustre]
    /home/lustre/master-mine/lustre/llite/llite_lib.c: 1804
 #9 [ffff891897e4fc38] ll_prep_inode at ffffffffc0f74b7d [lustre]
    /home/lustre/master-mine/lustre/llite/llite_lib.c: 3245
#10 [ffff891897e4fd00] ll_revalidate_it_finish at ffffffffc0f3a10b [lustre]
    /home/lustre/master-mine/lustre/llite/dcache.c: 271
#11 [ffff891897e4fd48] ll_inode_revalidate at ffffffffc0f49b6e [lustre]
    /home/lustre/master-mine/lustre/llite/file.c: 5170
#12 [ffff891897e4fdc0] ll_getattr_dentry at ffffffffc0f58a28 [lustre]
    /home/lustre/master-mine/lustre/llite/file.c: 5247
#13 [ffff891897e4fdf8] __set_pages_np at ffffffff820a3310
    /tmp/kernel/arch/x86/mm/pageattr.c: 2303
#14 [ffff891897e4fe68] vfs_statx_fd at ffffffff821cab4f
    /tmp/kernel/./include/linux/file.h: 43
#15 [ffff891897e4fea0] __se_sys_newfstat at ffffffff821cae4a
Generated at Sat Feb 10 03:13:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.