[LU-3727] LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body->valid & OBD_MD_FLID) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.1.5, Lustre 1.8.9, Lustre 2.4.1
Labels:
- patch

Severity:
3
Rank (Obsolete):
9597

Description

At GE Global Research, we ran into an LBUG with a 1.8.9 client that is re-exporting 2.1.5 Lustre:

Jul 31 10:26:46 scinfra3 kernel: Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
Jul 31 10:26:46 scinfra3 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Jul 31 10:26:46 scinfra3 kernel: NFSD: starting 90-second grace period
Jul 31 10:26:53 scinfra3 ntpd[8318]: synchronized to 3.40.208.30, stratum 2
Jul 31 10:29:46 scinfra3 kernel: LustreError: 27396:0:(llite_nfs.c:281:ll_get_parent()) ASSERTION(body->valid & OBD_MD_FLID) failed
Jul 31 10:29:46 scinfra3 kernel: LustreError: 27396:0:(llite_nfs.c:281:ll_get_parent()) LBUG
Jul 31 10:29:46 scinfra3 kernel: Pid: 27396, comm: nfsd
Jul 31 10:29:46 scinfra3 kernel:
Jul 31 10g:29:46 scinfra3 kernel: Call Trace:
Jul 31 10:29:46 scinfra3 kernel: [ ] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Jul 31 10:29:46 scinfra3 kernel: [ ] lbug_with_loc+0x7a/0xd0 [libcfs]
Jul 31 10:29:46 scinfra3 kernel: [ ] tracefile_init+0x0/0x110 [libcfs]
Jul 31 10:29:46 scinfra3 kernel: [ ] ll_get_parent+0x1e3/0x2b0 [lustre]
Jul 31 10:29:46 scinfra3 kernel: [ ] ll_get_dentry+0x6b/0xe0 [lustre]
Jul 31 10:29:46 scinfra3 kernel: [ ] mutex_lock+0xd/0x1d
Jul 31 10:29:46 scinfra3 kernel: [ ] find_exported_dentry+0x241/0x486 [exportfs]
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd_acceptable+0x0/0xdc [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] autoremove_wake_function+0x0/0x2e
Jul 31 10:29:46 scinfra3 kernel: [ ] sunrpc_cache_lookup+0x4b/0x128 [sunrpc]
Jul 31 10:29:46 scinfra3 kernel: [ ] exp_get_by_name+0x5b/0x71 [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] exp_find_key+0x89/0x9c [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd_acceptable+0x0/0xdc [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] ll_decode_fh+0x197/0x240 [lustre]
Jul 31 10:29:46 scinfra3 kernel: [ ] set_current_groups+0x116/0x164
Jul 31 10:29:46 scinfra3 kernel: [ ] fh_verify+0x29c/0x4cf [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd3_proc_getattr+0x8a/0xbe [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd_dispatch+0xd8/0x1d6 [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] svc_process+0x3f8/0x6bf [sunrpc]
Jul 31 10:29:46 scinfra3 kernel: [ ] __down_read+0x12/0x92
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x0/0x2cb [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x1a5/0x2cb [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] child_rip+0xa/0x11
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x0/0x2cb [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] nfsd+0x0/0x2cb [nfsd]
Jul 31 10:29:46 scinfra3 kernel: [ ] child_rip+0x0/0x11
Jul 31 10:29:46 scinfra3 kernel:

It appears to be easily reproducible, we are going to try to get a core dump, but I was wondering if there was anything obvious from this trace or any other jira tickets I might have missed. Also is there any other information that might be useful?

Thanks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log.txt
44 kB
01/Oct/13 12:26 PM
log.unlink08.lctl.dk.out.gz
3.52 MB
24/Oct/13 10:22 PM
lustre.log
3.60 MB
24/Oct/13 4:14 AM
unlink08.c
10 kB
23/Oct/13 7:07 PM

Issue Links

is related to

LU-5730 intermittent I/O errors for some directories

Resolved

Activity

[LU-3727] LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body->valid & OBD_MD_FLID) failed

Li Xi (Inactive) added a comment - 14/Nov/13 1:53 PM

Hi Alexey,

I am sorry, maybe because the lack of background knowledge, I don't understand the question well. Would you please explain a little bit about it? And do you have any specific problems about the patch?

Li Xi (Inactive) added a comment - 14/Nov/13 1:53 PM Hi Alexey, I am sorry, maybe because the lack of background knowledge, I don't understand the question well. Would you please explain a little bit about it? And do you have any specific problems about the patch?

Alexey Lyashkov added a comment - 14/Nov/13 12:31 PM

any ability to answer ?

Alexey Lyashkov added a comment - 14/Nov/13 12:31 PM any ability to answer ?

Patrick Farrell (Inactive) added a comment - 28/Oct/13 5:37 PM - edited

It might be worth noting that we hit this on 2.4.1. The ticket only lists 1.8.9/2.1.5.

Patrick Farrell (Inactive) added a comment - 28/Oct/13 5:37 PM - edited It might be worth noting that we hit this on 2.4.1. The ticket only lists 1.8.9/2.1.5.

Alexey Lyashkov added a comment - 25/Oct/13 12:28 AM

Hi Li,

main question for it - did we need a set intent disposition in reply. may you check - how it send from client? via mdc_intent_lock or other way ?

Alexey Lyashkov added a comment - 25/Oct/13 12:28 AM Hi Li, main question for it - did we need a set intent disposition in reply. may you check - how it send from client? via mdc_intent_lock or other way ?

Li Xi (Inactive) added a comment - 25/Oct/13 12:16 AM

Hi Alexey,

I agree on that mdt_raw_lookup() should not return 1 all the time. And follwoing patch tries to fix that too.
http://review.whamcloud.com/#/c/7327

Li Xi (Inactive) added a comment - 25/Oct/13 12:16 AM Hi Alexey, I agree on that mdt_raw_lookup() should not return 1 all the time. And follwoing patch tries to fix that too. http://review.whamcloud.com/#/c/7327

Patrick Farrell (Inactive) added a comment - 24/Oct/13 10:22 PM

MDS log during the test. Client LBUGged doing unlink8 test from LTP as described earlier.

Patrick Farrell (Inactive) added a comment - 24/Oct/13 10:22 PM MDS log during the test. Client LBUGged doing unlink8 test from LTP as described earlier.

Patrick Farrell (Inactive) added a comment - 24/Oct/13 10:21 PM

At Alexey's request, we reproduced this.

Here's the procedure from our test engineer:
—
1) Mount lustre on NFS server

2) Start nfsserver daemon on NFS server

3) Export nfs (sudo /usr/sbin/exportfs -i -o rw,insecure,no_root_squash,no_subtree_check,fsid=538 *:/extlus )

4) Mount NFS on client (sudo mount perses-esl3:/extlus /tmp/lus)

5) Run test using /tmp/lus
—
Other than fsid=, the other options are just what we usually use when testing NFS internally.

Attaching logs shortly.

Patrick Farrell (Inactive) added a comment - 24/Oct/13 10:21 PM At Alexey's request, we reproduced this. Here's the procedure from our test engineer: — 1) Mount lustre on NFS server 2) Start nfsserver daemon on NFS server 3) Export nfs (sudo /usr/sbin/exportfs -i -o rw,insecure,no_root_squash,no_subtree_check,fsid=538 *:/extlus ) 4) Mount NFS on client (sudo mount perses-esl3:/extlus /tmp/lus) 5) Run test using /tmp/lus — Other than fsid=, the other options are just what we usually use when testing NFS internally. Attaching logs shortly.

Alexey Lyashkov added a comment - 24/Oct/13 6:37 PM

Li,

may you look into MDT code to verify - why that error isn't returned correctly to the client?
from my point view it's should be addressed to the

#if 0
        /* XXX is raw_lookup possible as intent operation? */
        if (rc != 0) {
                if (rc == -ENOENT)
                        mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
                RETURN(rc);
        } else
                mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_POS);

        repbody = req_capsule_server_get(info->mti_pill, &RMF_MDT_BODY);
#endif

or we need to replace an 'RETURN(1);' to "return(rc)' at end of mdt_raw_lookup() function.

Alexey Lyashkov added a comment - 24/Oct/13 6:37 PM Li, may you look into MDT code to verify - why that error isn't returned correctly to the client? from my point view it's should be addressed to the # if 0 /* XXX is raw_lookup possible as intent operation? */ if (rc != 0) { if (rc == -ENOENT) mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG); RETURN(rc); } else mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_POS); repbody = req_capsule_server_get(info->mti_pill, &RMF_MDT_BODY); #endif or we need to replace an 'RETURN(1);' to "return(rc)' at end of mdt_raw_lookup() function.

Alexey Lyashkov added a comment - 24/Oct/13 6:32 PM

as i talk before - mdt generate an error as before

00000004:00000001:1.0:1382635559.670116:0:15672:0:(mdd_permission.c:309:__mdd_permission_internal()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:90:__mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:115:mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)

but that error isn't returned to the caller

00000004:00000001:1.0:1382635559.670119:0:15672:0:(mdt_handler.c:1273:mdt_getattr_name_lock()) Process leaving (rc=0 : 0 : 0)

i that case client correctly trigger a panic as we have none errors in processing, but reply isn't filled correctly.
that bug should be affect isn't NFS only.

Alexey Lyashkov added a comment - 24/Oct/13 6:32 PM as i talk before - mdt generate an error as before 00000004:00000001:1.0:1382635559.670116:0:15672:0:(mdd_permission.c:309:__mdd_permission_internal()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3) 00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:90:__mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3) 00000004:00000001:1.0:1382635559.670117:0:15672:0:(mdd_dir.c:115:mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3) but that error isn't returned to the caller 00000004:00000001:1.0:1382635559.670119:0:15672:0:(mdt_handler.c:1273:mdt_getattr_name_lock()) Process leaving (rc=0 : 0 : 0) i that case client correctly trigger a panic as we have none errors in processing, but reply isn't filled correctly. that bug should be affect isn't NFS only.

Alexey Lyashkov added a comment - 24/Oct/13 5:26 PM

Li,

thanks again. devil in details.. we need additional directory created in exported dir.
without it ll isn't trigger a bug.

Alexey Lyashkov added a comment - 24/Oct/13 5:26 PM Li, thanks again. devil in details.. we need additional directory created in exported dir. without it ll isn't trigger a bug.

Alexey Lyashkov added a comment - 24/Oct/13 5:49 AM

Li,

thanks!

Alexey Lyashkov added a comment - 24/Oct/13 5:49 AM Li, thanks!

People

Assignee:: Emoly Liu

Reporter:: Oz Rentas (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 08/Aug/13 5:53 PM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 10/Feb/15 12:39 PM