[LU-7613] racer crash on lustre nfs mount, kernel BUG at fs/namei.c:1669 Created: 28/Dec/15  Updated: 29/Nov/18  Resolved: 11/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Lokesh Nagappa Jaliminche (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Once in a few times racer crashes with the following logs:

<6>Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
<3>LustreError: 6426:0:(llite_nfs.c:307:ll_get_parent()) lustre: failure inode [0x200000400:0xee:0x0] get parent: rc = -2
<4>reconnect_path: npd != pd
<3>LustreError: 6424:0:(dir.c:429:ll_get_dir_page()) read cache page: [0x200000400:0x5d2:0x0] at 0: rc -2
<3>LustreError: 6424:0:(dir.c:597:ll_dir_read()) error reading dir [0x200000400:0x5d2:0x0] at 0: rc -2
<3>LustreError: 6431:0:(llite_nfs.c:307:ll_get_parent()) lustre: failure inode [0x200000400:0x5d2:0x0] get parent: rc = -2
<3>LustreError: 6431:0:(llite_nfs.c:307:ll_get_parent()) Skipped 2 previous similar messages
<4>------------[ cut here ]------------
<2>kernel BUG at fs/namei.c:1669!
<4>invalid opcode: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/dev
<4>CPU 2 
<4>Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 nfs fscache 8021q garp stp llc rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) ptp pps_core mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) compat(U) nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ext3 jbd uinput ppdev iTCO_wdt iTCO_vendor_support parport_pc parport microcode sg serio_raw i2c_i801 lpc_ich mfd_core r8169 mii snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 jbd2 mbcache sd_mod crc_t10dif ahci i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 6424, comm: nfsd Not tainted 2.6.32-431.17.1.x2.0.47.x86_64 #1                  /D525MWV
<4>RIP: 0010:[<ffffffff81197a24>]  [<ffffffff81197a24>] may_delete+0x134/0x190
<4>RSP: 0018:ffff8800be16bc30  EFLAGS: 00010283
<4>RAX: ffff8800375b5c00 RBX: ffff88009b347180 RCX: ffff88009b26f3c0
<4>RDX: 0000000000000000 RSI: ffff88009b347180 RDI: ffff880104beeb38
<4>RBP: ffff8800be16bc50 R08: ffff88003753a980 R09: ffff88003753a980
<4>R10: ffff880104beeb38 R11: ffff880104beeb38 R12: ffff880104beeb38
<4>R13: 0000000000000000 R14: 0000000000000000 R15: ffff880104beeb38
<4>FSe:  0000000000000000(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000003a50f5a04c CR3: 00000000964e7000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process nfsd (pid: 6424, threadinfo ffff8800be16a000, task ffff880102fe2080)
<4>Stack:
<4> 0000000000000000 ffff88009b347180 ffff88009b26f3c0 0000000000000000
<4><d> ffff8800be16bcd0 ffffffff81197cbc ffff8800be16bc70 ffff88003753a980
<4><d> ffff8800a45d80ba ffff8800bfb4b040 ffff8800a45d80b8 00000000ffffffea
<4>Call Trace:
<4> [<ffffffff81197cbc>] vfs_rename+0x5c/0x480
<4> [<ffffffffa03d0aca>] nfsd_rename+0x47a/0x4d0 [nfsd]
<4> [<ffffffffa03dd585>] nfsd4_rename+0x75/0x220 [nfsd]
<4> [<ffffffffa03df435>] ? nfsd4_encode_operation+0x75/0x180 [nfsd]
<4> [<ffffffffa03dd458>] nfsd4_proc_compound+0x3d8/0x490 [nfsd]
<4> [<ffffffffa03ca425>] nfsd_dispatch+0xe5/0x230 [nfsd]
<4> [<ffffffffa035a844>] svc_process_common+0x344/0x640 [sunrpc]
<4> [<ffffffff81061dc0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa035ae80>] svc_process+0x110/0x160 [sunrpc]
<4> [<ffffffffa03cab52>] nfsd+0xc2/0x160 [nfsd]
<4> [<ffffffffa03caa90>] ? nfsd+0x0/0x160 [nfsd]
<4> [<ffffffff8109ac66>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109abd0>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

The appropriate kernel code is:

static int may_delete(struct inode *dir,struct dentry *victim,int isdir)
{
        int error;

        if (!victim->d_inode)
                return -ENOENT;

        BUG_ON(victim->d_parent->d_inode != dir);


 Comments   
Comment by Lokesh Nagappa Jaliminche (Inactive) [ 28/Dec/15 ]

Recreation steps:
===============
1. cat /etc/exports
/mnt/lustre *(crossmnt,rw,no_root_squash,async,no_subtree_check,insecure)
2. mkdir /mnt/nfs_client
3. /etc/init.d/nfs stop
4. bash llmount.sh
5. /etc/init.d/nfs start
6. mount -t nfs server:/mnt/lustre /mnt/nfs_client
7. cd racer
8. bash racer.sh /mnt/nfs_client/
9. umount /mnt/nfs_client
10. /etc/init.d/nfs stop
11. bash llmountcleanup.sh

Comment by Gerrit Updater [ 28/Dec/15 ]

lokesh.jaliminche (lokesh.jaliminche@seagate.com) uploaded a new patch: http://review.whamcloud.com/17732
Subject: LU-7613 dcache: changes made to ll_splice_inode to avoid dcache corruption.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3e7901e9b654989502c951fb859e505ce4a4a8cb

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 10/Feb/16 ]

ll_find_alias is responsible for getting alias for inode which can be reused. Directories are asumed to have unique alias. Where in case of non-directories there can be multiple
aliases. In case of lustre there can be two type of aliases i.e. discon_alias and invalid_alias. Invalid_alias is an alias which satisfies these conditions

 else if (alias->d_parent == dentry->d_parent             &&
                         alias->d_name.hash == dentry->d_name.hash       &&
                         alias->d_name.len == dentry->d_name.len         &&
                         memcmp(alias->d_name.name, dentry->d_name.name,
                                dentry->d_name.len) == 0)

Usage of discon_alias in case of non-directories may corrupt dcache and leads to kernel crash. Patch created to avoid usage of discon_alias in case of non-directories

Comment by Gerrit Updater [ 11/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17732/
Subject: LU-7613 llite: changes to avoid cache corruption
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf6efbdb726ceae10a9f3c770bc7af9d15571a80

Comment by Peter Jones [ 11/Aug/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:10:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.