[LU-3717] Kernel panic in ll_encode_fh() while testing file handle syscalls on FC18 client Created: 07/Aug/13  Updated: 04/Dec/13  Resolved: 04/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Swapnil Pimpale (Inactive) Assignee: Jian Yu
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-4231 NFS reexport leads to LBUG in mainlin... Resolved
Related
is related to LU-3344 Add open_by_handle() syscall for Lust... Resolved
Severity: 3
Rank (Obsolete): 9574

 Description   

Hit a kernel panic while trying to test the new file handle syscalls (name_to_handle_at()/open_by_handle_at())

To reproduce follow the following steps:
1) Apply patch (http://review.whamcloud.com/#/c/7247/). This patch adds a new file lustre/tests/check_fhandle_syscalls.c
2) Compile lustre client
3) Setup lustre (sh lustre/tests/llmount.sh)
4) Create a temporary file in FS (echo "testing new syscalls" > /mnt/lustre/temp_file)
5) Run the test utility as follows:
cd lustre/tests;
./check_fhandle_syscalls temp_file /mnt/lustre

The following is the stack trace of the panic:

crash> bt -l
PID: 2139 TASK: ffff880011495c40 CPU: 0 COMMAND: "check_fhandle_s"
#0 [ffff8800115cbc90] machine_kexec at ffffffff8103e9a5
/usr/src/debug/kernel-3.6.fc18/linux-3.6.10-4.fc18.x86_64/arch/x86/kernel/machine_kexec_64.c: 339
#1 [ffff8800115cbd00] crash_kexec at ffffffff810c4118
/usr/src/debug/kernel-3.6.fc18/linux-3.6.10-4.fc18.x86_64/kernel/kexec.c: 1100
#2 [ffff8800115cbdd0] panic at ffffffff816198e2
/usr/src/debug/kernel-3.6.fc18/linux-3.6.10-4.fc18.x86_64/arch/x86/include/asm/smp.h: 95
#3 [ffff8800115cbe50] lbug_with_loc at ffffffffa0418e5b [libcfs]
#4 [ffff8800115cbe90] ll_encode_fh at ffffffffa0961b75 [lustre]
#5 [ffff8800115cbed0] exportfs_encode_fh at ffffffff81264ce4
/usr/src/debug/kernel-3.6.fc18/linux-3.6.10-4.fc18.x86_64/fs/exportfs/expfs.c: 361
#6 [ffff8800115cbf10] sys_name_to_handle_at at ffffffff811e8f36
/usr/src/debug/kernel-3.6.fc18/linux-3.6.10-4.fc18.x86_64/fs/fhandle.c: 52
#7 [ffff8800115cbf80] system_call_fastpath at ffffffff8162bae9
/usr/src/debug/kernel-3.6.fc18/linux-3.6.10-4.fc18.x86_64/arch/x86/kernel/entry_64.S: 532
RIP: 00000030828f309a RSP: 00007fff2f0d94c8 RFLAGS: 00010202
RAX: 000000000000012f RBX: ffffffff8162bae9 RCX: 00007fff2f0d9578
RDX: 0000000000720010 RSI: 00007fff2f0da853 RDI: 0000000000000003
RBP: 00007fff2f0d95b0 R8: 0000000000000400 R9: 616e20676e696c6c
R10: 00007fff2f0d9578 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fff2f0d9690 R15: 00000000004008f0
ORIG_RAX: 000000000000012f CS: 0033 SS: 002b



 Comments   
Comment by Oleg Drokin [ 08/Aug/13 ]

The LBUG is due to ll_inode2fid() wishing that the inode is not NULL, and it's somehow passed in as NULL to ll_encode_fh()
So we need to check for that and take some appropriate action.

Comment by Jian Yu [ 24/Oct/13 ]

On FC18 client node, I set panic_on_lbug=0 and got the lctl debug log as follows:

00000080:00000001:2.0:1382612125.266070:0:21111:0:(llite_nfs.c:187:ll_encode_fh()) Process entered
00000080:00000040:2.0:1382612125.266071:0:21111:0:(llite_nfs.c:191:ll_encode_fh()) encoding for (144115205255725059,[0x200000400:0x3:0x0]) maxlen=32 minlen=32
00000080:00040000:2.0:1382612125.266073:0:21111:0:(llite_internal.h:1166:ll_inode2fid()) ASSERTION( inode != ((void *)0) ) failed:
00000080:00040000:2.0:1382612125.277251:0:21111:0:(llite_internal.h:1166:ll_inode2fid()) LBUG

In ll_encode_fh():

static int ll_encode_fh(struct inode *inode, __u32 *fh, int *plen,
                        struct inode *parent)
{
        //......
        CDEBUG(D_INFO, "encoding for (%lu,"DFID") maxlen=%d minlen=%d\n",
               inode->i_ino, PFID(ll_inode2fid(inode)), *plen,
               (int)sizeof(struct lustre_nfs_fid));

        //......
        nfs_fid->lnf_child = *ll_inode2fid(inode);
        nfs_fid->lnf_parent = *ll_inode2fid(parent); <------ parent was NULL, which caused the ASSERTION failure
        //......
}

Need to dig out why "parent" passed from exportfs_encode_fh() to ll_encode_fh() was NULL.

Comment by Jian Yu [ 25/Oct/13 ]

In Linux kernel 3.6.10-4 used by FC18, exportfs_encode_fh() was called from do_sys_name_to_handle() as follows:

static long do_sys_name_to_handle(struct path *path,
                                  struct file_handle __user *ufh,
                                  int __user *mnt_id)
{
        //......
        /* we ask for a non connected handle */
        retval = exportfs_encode_fh(path->dentry,
                                    (struct fid *)handle->f_handle,
                                    &handle_dwords,  0); <------ Here, 0 was passed to exportfs_encode_fh().
        //......
}

While in exportfs_encode_fh(), the codes are:

int exportfs_encode_fh(struct dentry *dentry, struct fid *fid, int *max_len,
                       int connectable)
{
        //......
        struct inode *inode = dentry->d_inode, *parent = NULL;

        if (connectable && !S_ISDIR(inode->i_mode)) { <------ Here, connectable was 0.
                p = dget_parent(dentry);
                //......
                parent = p->d_inode;
        }
        if (nop->encode_fh)
                error = nop->encode_fh(inode, fid->raw, max_len, parent); <------ Here, parent was NULL.
        //......
}

So, exportfs_encode_fh() finally passed "parent" parameter as NULL to ll_encode_fh().
The ll_encode_fh() should check the "parent" value before running ll_inode2fid(). I'll upload a patch.

Comment by Jian Yu [ 25/Oct/13 ]

Patch for master branch is in http://review.whamcloud.com/8072.

Comment by Dmitry Eremin (Inactive) [ 26/Nov/13 ]

It's a similar issue to LU-4231. I have provided other patch http://review.whamcloud.com/8347 that more accurate distinguish cases where parent is NULL and not.

Comment by James A Simmons [ 03/Dec/13 ]

Just tested the http://review.whamcloud.com/8347 patch and I get this:

[root@spoon46 ~]# /usr/lib64/lustre/tests/check_fhandle_syscalls temp-file /lustre/barry/
fh_bytes: 32
fh_type: 151
fh_data: 0 4 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
check_fhandle_syscalls test Passed!

It appears to work correctly.

Comment by Jian Yu [ 04/Dec/13 ]

The patch in http://review.whamcloud.com/8347 resolves the failure. Let's close this ticket as a duplicate of LU-4231.

Generated at Sat Feb 10 01:36:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.