Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3727

LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body->valid & OBD_MD_FLID) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • Lustre 2.1.5, Lustre 1.8.9, Lustre 2.4.1
    • 3
    • 9597

    Description

      At GE Global Research, we ran into an LBUG with a 1.8.9 client that is re-exporting 2.1.5 Lustre:

      Jul 31 10:26:46 scinfra3 kernel: Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
      Jul 31 10:26:46 scinfra3 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
      Jul 31 10:26:46 scinfra3 kernel: NFSD: starting 90-second grace period
      Jul 31 10:26:53 scinfra3 ntpd[8318]: synchronized to 3.40.208.30, stratum 2
      Jul 31 10:29:46 scinfra3 kernel: LustreError: 27396:0:(llite_nfs.c:281:ll_get_parent()) ASSERTION(body->valid & OBD_MD_FLID) failed
      Jul 31 10:29:46 scinfra3 kernel: LustreError: 27396:0:(llite_nfs.c:281:ll_get_parent()) LBUG
      Jul 31 10:29:46 scinfra3 kernel: Pid: 27396, comm: nfsd
      Jul 31 10:29:46 scinfra3 kernel:
      Jul 31 10g:29:46 scinfra3 kernel: Call Trace:
      Jul 31 10:29:46 scinfra3 kernel: [ ] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      Jul 31 10:29:46 scinfra3 kernel: [ ] lbug_with_loc+0x7a/0xd0 [libcfs]
      Jul 31 10:29:46 scinfra3 kernel: [ ] tracefile_init+0x0/0x110 [libcfs]
      Jul 31 10:29:46 scinfra3 kernel: [ ] ll_get_parent+0x1e3/0x2b0 [lustre]
      Jul 31 10:29:46 scinfra3 kernel: [ ] ll_get_dentry+0x6b/0xe0 [lustre]
      Jul 31 10:29:46 scinfra3 kernel: [ ] mutex_lock+0xd/0x1d
      Jul 31 10:29:46 scinfra3 kernel: [ ] find_exported_dentry+0x241/0x486 [exportfs]
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd_acceptable+0x0/0xdc [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] autoremove_wake_function+0x0/0x2e
      Jul 31 10:29:46 scinfra3 kernel: [ ] sunrpc_cache_lookup+0x4b/0x128 [sunrpc]
      Jul 31 10:29:46 scinfra3 kernel: [ ] exp_get_by_name+0x5b/0x71 [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] exp_find_key+0x89/0x9c [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd_acceptable+0x0/0xdc [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] ll_decode_fh+0x197/0x240 [lustre]
      Jul 31 10:29:46 scinfra3 kernel: [ ] set_current_groups+0x116/0x164
      Jul 31 10:29:46 scinfra3 kernel: [ ] fh_verify+0x29c/0x4cf [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd3_proc_getattr+0x8a/0xbe [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd_dispatch+0xd8/0x1d6 [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] svc_process+0x3f8/0x6bf [sunrpc]
      Jul 31 10:29:46 scinfra3 kernel: [ ] __down_read+0x12/0x92
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x0/0x2cb [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x1a5/0x2cb [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] child_rip+0xa/0x11
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x0/0x2cb [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x0/0x2cb [nfsd]
      Jul 31 10:29:46 scinfra3 kernel: [ ] child_rip+0x0/0x11
      Jul 31 10:29:46 scinfra3 kernel:

      It appears to be easily reproducible, we are going to try to get a core dump, but I was wondering if there was anything obvious from this trace or any other jira tickets I might have missed. Also is there any other information that might be useful?

      Thanks.

      Attachments

        1. log.txt
          44 kB
        2. log.unlink08.lctl.dk.out.gz
          3.52 MB
        3. lustre.log
          3.60 MB
        4. unlink08.c
          10 kB

        Issue Links

          Activity

            [LU-3727] LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body->valid & OBD_MD_FLID) failed

            Hi Alexey,

            I am sorry, maybe because the lack of background knowledge, I don't understand the question well. Would you please explain a little bit about it? And do you have any specific problems about the patch?

            lixi Li Xi (Inactive) added a comment - Hi Alexey, I am sorry, maybe because the lack of background knowledge, I don't understand the question well. Would you please explain a little bit about it? And do you have any specific problems about the patch?

            any ability to answer ?

            shadow Alexey Lyashkov added a comment - any ability to answer ?

            It might be worth noting that we hit this on 2.4.1. The ticket only lists 1.8.9/2.1.5.

            paf Patrick Farrell (Inactive) added a comment - - edited It might be worth noting that we hit this on 2.4.1. The ticket only lists 1.8.9/2.1.5.

            Hi Li,

            main question for it - did we need a set intent disposition in reply. may you check - how it send from client? via mdc_intent_lock or other way ?

            shadow Alexey Lyashkov added a comment - Hi Li, main question for it - did we need a set intent disposition in reply. may you check - how it send from client? via mdc_intent_lock or other way ?

            Hi Alexey,

            I agree on that mdt_raw_lookup() should not return 1 all the time. And follwoing patch tries to fix that too.
            http://review.whamcloud.com/#/c/7327

            lixi Li Xi (Inactive) added a comment - Hi Alexey, I agree on that mdt_raw_lookup() should not return 1 all the time. And follwoing patch tries to fix that too. http://review.whamcloud.com/#/c/7327

            MDS log during the test. Client LBUGged doing unlink8 test from LTP as described earlier.

            paf Patrick Farrell (Inactive) added a comment - MDS log during the test. Client LBUGged doing unlink8 test from LTP as described earlier.

            At Alexey's request, we reproduced this.

            Here's the procedure from our test engineer:

            1) Mount lustre on NFS server

            2) Start nfsserver daemon on NFS server

            3) Export nfs (sudo /usr/sbin/exportfs -i -o rw,insecure,no_root_squash,no_subtree_check,fsid=538 *:/extlus )

            4) Mount NFS on client (sudo mount perses-esl3:/extlus /tmp/lus)

            5) Run test using /tmp/lus

            Other than fsid=, the other options are just what we usually use when testing NFS internally.

            Attaching logs shortly.

            paf Patrick Farrell (Inactive) added a comment - At Alexey's request, we reproduced this. Here's the procedure from our test engineer: — 1) Mount lustre on NFS server 2) Start nfsserver daemon on NFS server 3) Export nfs (sudo /usr/sbin/exportfs -i -o rw,insecure,no_root_squash,no_subtree_check,fsid=538 *:/extlus ) 4) Mount NFS on client (sudo mount perses-esl3:/extlus /tmp/lus) 5) Run test using /tmp/lus — Other than fsid=, the other options are just what we usually use when testing NFS internally. Attaching logs shortly.

            Li,

            may you look into MDT code to verify - why that error isn't returned correctly to the client?
            from my point view it's should be addressed to the

            #if 0
                    /* XXX is raw_lookup possible as intent operation? */
                    if (rc != 0) {
                            if (rc == -ENOENT)
                                    mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
                            RETURN(rc);
                    } else
                            mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_POS);
            
                    repbody = req_capsule_server_get(info->mti_pill, &RMF_MDT_BODY);
            #endif
            

            or we need to replace an 'RETURN(1);' to "return(rc)' at end of mdt_raw_lookup() function.

            shadow Alexey Lyashkov added a comment - Li, may you look into MDT code to verify - why that error isn't returned correctly to the client? from my point view it's should be addressed to the # if 0 /* XXX is raw_lookup possible as intent operation? */ if (rc != 0) { if (rc == -ENOENT) mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG); RETURN(rc); } else mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_POS); repbody = req_capsule_server_get(info->mti_pill, &RMF_MDT_BODY); #endif or we need to replace an 'RETURN(1);' to "return(rc)' at end of mdt_raw_lookup() function.

            as i talk before - mdt generate an error as before

            00000004:00000001:1.0:1382635559.670116:0:15672:0:(mdd_permission.c:309:__mdd_permission_internal()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
            00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:90:__mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
            00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:115:mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
            

            but that error isn't returned to the caller

            00000004:00000001:1.0:1382635559.670119:0:15672:0:(mdt_handler.c:1273:mdt_getattr_name_lock()) Process leaving (rc=0 : 0 : 0)
            

            i that case client correctly trigger a panic as we have none errors in processing, but reply isn't filled correctly.
            that bug should be affect isn't NFS only.

            shadow Alexey Lyashkov added a comment - as i talk before - mdt generate an error as before 00000004:00000001:1.0:1382635559.670116:0:15672:0:(mdd_permission.c:309:__mdd_permission_internal()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3) 00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:90:__mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3) 00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:115:mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3) but that error isn't returned to the caller 00000004:00000001:1.0:1382635559.670119:0:15672:0:(mdt_handler.c:1273:mdt_getattr_name_lock()) Process leaving (rc=0 : 0 : 0) i that case client correctly trigger a panic as we have none errors in processing, but reply isn't filled correctly. that bug should be affect isn't NFS only.

            Li,

            thanks again. devil in details.. we need additional directory created in exported dir.
            without it ll isn't trigger a bug.

            shadow Alexey Lyashkov added a comment - Li, thanks again. devil in details.. we need additional directory created in exported dir. without it ll isn't trigger a bug.

            Li,

            thanks!

            shadow Alexey Lyashkov added a comment - Li, thanks!

            People

              emoly.liu Emoly Liu
              orentas Oz Rentas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: