Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14862

Zero fid assignment post eviction followed by ASSERTION( name->name[0] != 0 ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like lmv_revalidate_slaves() does not validate the fid it receives in the loop (unlike many other places that check with fid_is_sane) and calls mdc_intent_lock on a zero fid causing the assertion to hit.

      Now it look the problem is deeper somewhere since we even get said zero fid assigned somewhere, though I am not sure where. The trace below is from racer on latest master

       [ 2289.873955] LustreError: 27285:0:(file.c:5018:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000402:0x1:0x0] error: rc = -5
      [ 2289.879247] LustreError: 27285:0:(file.c:5018:ll_inode_revalidate_fini()) Skipped 407 previous similar messages
      [ 2290.097986] LustreError: 27109:0:(vvp_io.c:1793:vvp_io_init()) lustre: refresh file layout [0x200000406:0xd650:0x0] error -108.
      [ 2293.106030] LustreError: 28772:0:(llite_nfs.c:342:ll_dir_get_parent_fid()) lustre: failure inode [0x240000409:0x4bab:0x0] get parent: rc = -116
      [ 2293.108749] LustreError: 28772:0:(llite_nfs.c:342:ll_dir_get_parent_fid()) Skipped 29 previous similar messages
      [ 2303.467231] Lustre: 14085:0:(client.c:2285:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1626631396/real 1626631396]  req@ffff8800be874280 x1705644635655488/t0(0) o36->lustre-MDT0000-mdc-ffff8800d6af9800@192.168.201.101@tcp:12/10 lens 496/440 e 0 to 1 dl 1626631525 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'setfattr.0'
      [ 2303.473349] Lustre: 14085:0:(client.c:2285:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
      [ 2377.306136] LustreError: 13927:0:(llite_lib.c:1674:ll_update_lsm_md()) lustre: [0x200000404:0x1ea58:0x0] dir layout mismatch:
      [ 2377.308485] LustreError: 13927:0:(lustre_lmv.h:134:lsm_md_dump()) magic 0xcd20cd0 stripe count 1 master mdt 0 hash type crush:0x2000003 max inherit 0 version 1 migrate offset 0 migrate hash 0x0 pool 
      [ 2377.311901] LustreError: 13927:0:(lustre_lmv.h:141:lsm_md_dump()) stripe[0] [0x200000400:0x529:0x0]
      [ 2377.313845] LustreError: 13927:0:(lustre_lmv.h:134:lsm_md_dump()) magic 0xcd20cd0 stripe count 1 master mdt 0 hash type crush:0x2000003 max inherit 0 version 1 migrate offset 0 migrate hash 0x0 pool 
      [ 2377.317112] LustreError: 13927:0:(lustre_lmv.h:141:lsm_md_dump()) stripe[0] [0x0:0x0:0x0]
      [ 2377.318945] LustreError: 13927:0:(llite_lib.c:2955:ll_prep_inode()) new_inode -fatal: rc -22
      [ 2377.320628] LustreError: 13927:0:(llite_lib.c:2955:ll_prep_inode()) Skipped 19 previous similar messages
      [ 2377.320805] LustreError: 13967:0:(ldlm_resource.c:1498:ldlm_resource_get()) ASSERTION( name->name[0] != 0 ) failed: 
      [ 2377.320815] LustreError: 13967:0:(ldlm_resource.c:1498:ldlm_resource_get()) LBUG
      [ 2377.320820] Pid: 13967, comm: ls 3.10.0-7.9-debug #1 SMP Mon Feb 1 17:33:41 EST 2021
      [ 2377.320820] Call Trace:
      [ 2377.320914] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [ 2377.320924] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 2377.321103] [<0>] ldlm_resource_get+0x7d9/0x940 [ptlrpc]
      [ 2377.321137] [<0>] ldlm_lock_create+0x55/0x9e0 [ptlrpc]
      [ 2377.321177] [<0>] ldlm_cli_enqueue+0xc8/0x9e0 [ptlrpc]
      [ 2377.321186] [<0>] mdc_enqueue_base+0x323/0x14a0 [mdc]
      [ 2377.321191] [<0>] mdc_intent_lock+0x135/0x570 [mdc]
      [ 2377.321213] [<0>] lmv_revalidate_slaves+0x452/0xbd0 [lmv]
      [ 2377.321216] [<0>] lmv_merge_attr+0x45/0x1b0 [lmv]
      [ 2377.321285] [<0>] ll_getattr_dentry+0x6c5/0x920 [lustre]
      [ 2377.321296] [<0>] ll_getattr+0x1e/0x20 [lustre]
      [ 2377.321300] [<0>] vfs_getattr+0x46/0x80
      [ 2377.321301] [<0>] vfs_fstat+0x45/0x80
      [ 2377.321303] [<0>] SYSC_newfstat+0x24/0x60
      [ 2377.321305] [<0>] SyS_newfstat+0xe/0x10
      [ 2377.321308] [<0>] system_call_fastpath+0x1f/0x24
      [ 2377.321345] [<0>] 0xfffffffffffffffe
      [ 2377.321347] Kernel panic - not syncing: LBUG

      But when I was searching for similar crashes, I also arrived at this https://testing.whamcloud.com/test_sets/4fe20054-a07f-11e9-9e3d-52540065bddc which is essentially the same crash only we arrived there in a bit of a different way - not because so "dir layout mismatch" seems to be a bit of a red herring and a fallout from earlier incorrect fid assignment too.

      Attachments

        Activity

          People

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: