[LU-10541] parallel-scale-nfsv3 test racer_on_nfs hangs with BUG: unable to handle kernel NULL pointer dereference Created: 21/Jan/18  Updated: 03/May/18  Resolved: 09/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4
Fix Version/s: Lustre 2.12.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None
Environment:

SLES12 SP3 server/client
and SLES12 SP2 server/client


Issue Links:
Related
is related to LU-9679 Prepare lustre for adoption into the ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv3 test_racer_on_nfs hangs for SLES12 SP3 and SLES12 SP2 server/client only.

From the MDS console:

[76114.715893] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 06:10:32 \(1516371032\)
[76114.743657] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 06:10:32 (1516371032)
[76114.921643] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[76114.922573] IP: [<ffffffffa13ec095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
[76114.923403] PGD 0 
[76114.923653] Oops: 0000 [#1] SMP 
[76114.924045] Modules linked in: nfsd(E) nfs_acl(E) lustre(OEN) lmv(OEN) mdc(OEN) osc(OEN) lov(OEN) osp(OEN) mdd(OEN) lod(OEN) mdt(OEN) lfsck(OEN) mgs(OEN) mgc(OEN) osd_ldiskfs(OEN) ldiskfs(OEN) lquota(OEN) fid(OEN) fld(OEN) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) loop(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rpcrdma(E) sunrpc(E) ib_isert(E) iscsi_target_mod(E) ib_iser(E) libiscsi(E) scsi_transport_iscsi(E) ib_srpt(E) target_core_mod(E) ib_srp(E) scsi_transport_srp(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) configfs(E) ib_cm(E) iw_cm(E) ib_core(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) jitterentropy_rng(E) drbg(E) ansi_cprng(E) ppdev(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8139too(E) joydev(E) 8139cp(E) pcspkr(E) virtio_balloon(E) i2c_piix4(E) mii(E) parport_pc(E) parport(E) pvpanic(E) button(E) processor(E) ata_generic(E) ext4(E) crc16(E) jbd2(E) mbcache(E) ata_piix(E) ahci(E) libahci(E) virtio_blk(E) floppy(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) usb_common(E) libata(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) [last unloaded: lnet_selftest]
[76114.938788] Supported: No, Unsupported modules are loaded
[76114.939342] CPU: 0 PID: 28809 Comm: nfsd Tainted: G           OE   N  4.4.103-6.33_lustre-default #1
[76114.940280] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[76114.940875] task: ffff88007c3254c0 ti: ffff88002c48c000 task.ti: ffff88002c48c000
[76114.941635] RIP: 0010:[<ffffffffa13ec095>]  [<ffffffffa13ec095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
[76114.942668] RSP: 0018:ffff88002c48fd90  EFLAGS: 00010246
[76114.943220] RAX: ffffffffa14094c0 RBX: 0000000000008000 RCX: 0000000000000000
[76114.943945] RDX: ffffffffa140b05b RSI: 0000000000000000 RDI: ffffffffa14094c0
[76114.944670] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[76114.945399] R10: 0000000000000000 R11: ffff88006a9320fb R12: ffff880065cfce50
[76114.946129] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffa140b05b
[76114.946859] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[76114.947677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[76114.948266] CR2: 0000000000000030 CR3: 000000007a2e6000 CR4: 00000000000406f0
[76114.948994] Stack:
[76114.949215]  ffffffffa13b499e 0000000000000000 ffff880065d1f000 ffff88002c48fe00
[76114.950059]  ffff88003b8c0000 ffff88003b8c0008 ffff880065cfce50 ffff880078199900
[76114.950904]  ffffffffa1479d8f ffff8800442dc000 ffff8800442dc000 ffff880044086018
[76114.951816] Call Trace:
[76114.952142]  [<ffffffffa13b499e>] ll_set_acl+0xee/0x370 [lustre]
[76114.952790]  [<ffffffffa1479d8f>] nfsd3_proc_setacl+0x19f/0x260 [nfsd]
[76114.953482]  [<ffffffffa1469e23>] nfsd_dispatch+0xc3/0x260 [nfsd]
[76114.954139]  [<ffffffffa05811f8>] svc_process_common+0x418/0x6a0 [sunrpc]
[76114.954843]  [<ffffffffa058157d>] svc_process+0xfd/0x1b0 [sunrpc]
[76114.955483]  [<ffffffffa146989a>] nfsd+0xea/0x160 [nfsd]
[76114.956044]  [<ffffffff8109bb57>] kthread+0xc7/0xe0
[76114.956552]  [<ffffffff8160c67f>] ret_from_fork+0x3f/0x70
[76114.958488] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[76114.959116] 
[76114.959290] Leftover inexact backtrace:
[76114.959290] 
[76114.959848]  [<ffffffff8109ba90>] ? kthread_park+0x50/0x50
[76114.960412] Code: c0 c7 05 63 3e 04 00 00 00 04 00 e8 96 d0 3b ff 48 c7 c7 c0 fe 42 a1 e8 7a 26 3b ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> 8b 76 30 e9 32 ed ff ff 66 90 66 66 66 66 90 41 55 41 54 55 
[76114.963792] RIP  [<ffffffffa13ec095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
[76114.964576]  RSP <ffff88002c48fd90>
[76114.964949] CR2: 0000000000000030

In a different test session, https://testing.hpdd.intel.com/test_sets/bba5459c-fc9f-11e7-a10a-52540065bddc , we see the same hang, but there is a little more output in the MDS console log:

[32570.101935] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 10:52:35 (1516301555)
[32570.288540] LustreError: 18012:0:(namei.c:87:ll_set_inode()) Can not initialize inode [0x240000404:0x1:0x0] without object type: valid = 0x100000001
[32570.288547] LustreError: 18012:0:(llite_lib.c:2354:ll_prep_inode()) new_inode -fatal: rc -12
[32570.402366] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[32570.402366] IP: [<ffffffffa1461095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]

I can confirm that this test started hanging with this NULL pointer dereference on at least 2018-01-16. This test hangs so often that I can’t review all test hangs to see exactly when this started happening.

Logs for some of the failures are at
https://testing.hpdd.intel.com/test_sets/ebd266f6-fac8-11e7-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/9dfe1932-fd36-11e7-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/ce6de31a-fe15-11e7-a10a-52540065bddc



 Comments   
Comment by Saurabh Tandan (Inactive) [ 30/Jan/18 ]

Hit the same error mentioned above for "RHEL 7.4 Server/DNE/ldiskfs RHEL 7.4 Client" on 2.10.57 for racer test_1
https://testing.hpdd.intel.com/test_sets/463269c4-fd9a-11e7-bd00-52540065bddc

[55555.700895] LustreError: 7868:0:(namei.c:87:ll_set_inode()) Can not initialize inode [0x200000407:0x12:0x0] without object type: valid = 0x100000001
[55555.702444] LustreError: 7868:0:(llite_lib.c:2354:ll_prep_inode()) new_inode -fatal: rc -12
[55563.234885] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x280000405:0x50:0x0], use llapi_layout_get_by_path()
Comment by Minh Diep [ 08/Feb/18 ]

+1 on b2_10
https://testing.hpdd.intel.com/test_sets/472f2886-0b90-11e8-bd00-52540065bddc

Comment by Peter Jones [ 08/Feb/18 ]

Hongchao

Could you please investigate?

Thanks

Peter

Comment by John Hammond [ 08/Feb/18 ]

It seems likely that this is caused by the following. In ll_set_acl() we do

        /* dentry is only used for *.lov attributes so it's safe to be NULL */
        rc = __vfs_setxattr(NULL, inode, name, value, size, XATTR_CREATE);

And we have

static int ll_xattr_set_common_4_3(const struct xattr_handler *handler,
                                   struct dentry *dentry, const char *name,
                                   const void *value, size_t size, int flags)
{
        return ll_xattr_set_common(handler, dentry, dentry->d_inode, name,
                                   value, size, flags);
}
Comment by Gerrit Updater [ 02/Mar/18 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/31486
Subject: LU-10541 llite: use faked dentry in ll_set_acl
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a20273abded0a7b6add3b1c50ca61c0938abd71f

Comment by Gerrit Updater [ 08/Mar/18 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31588
Subject: LU-10541 llite: setxattr directly in ll_set_acl
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 45939caa6af84d80abebbeb33525c4f179774c53

Comment by Gerrit Updater [ 09/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31588/
Subject: LU-10541 llite: setxattr directly in ll_set_acl
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 682939ffb60e9afff9a7d61cc1112611b2197b78

Comment by Peter Jones [ 09/Apr/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 11/Apr/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31966
Subject: LU-10541 llite: setxattr directly in ll_set_acl
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: b95bcc9340522a04bacd62332423567040d2f553

Comment by Gerrit Updater [ 03/May/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31966/
Subject: LU-10541 llite: setxattr directly in ll_set_acl
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 51774da37e27b696e506f76c82a3ae2507df1dc2

Generated at Sat Feb 10 02:36:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.