Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10541

parallel-scale-nfsv3 test racer_on_nfs hangs with BUG: unable to handle kernel NULL pointer dereference

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.4
    • Lustre 2.11.0, Lustre 2.10.4
    • None
    • SLES12 SP3 server/client
      and SLES12 SP2 server/client
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs hangs for SLES12 SP3 and SLES12 SP2 server/client only.

      From the MDS console:

      [76114.715893] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 06:10:32 \(1516371032\)
      [76114.743657] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 06:10:32 (1516371032)
      [76114.921643] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [76114.922573] IP: [<ffffffffa13ec095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
      [76114.923403] PGD 0 
      [76114.923653] Oops: 0000 [#1] SMP 
      [76114.924045] Modules linked in: nfsd(E) nfs_acl(E) lustre(OEN) lmv(OEN) mdc(OEN) osc(OEN) lov(OEN) osp(OEN) mdd(OEN) lod(OEN) mdt(OEN) lfsck(OEN) mgs(OEN) mgc(OEN) osd_ldiskfs(OEN) ldiskfs(OEN) lquota(OEN) fid(OEN) fld(OEN) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) loop(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rpcrdma(E) sunrpc(E) ib_isert(E) iscsi_target_mod(E) ib_iser(E) libiscsi(E) scsi_transport_iscsi(E) ib_srpt(E) target_core_mod(E) ib_srp(E) scsi_transport_srp(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) configfs(E) ib_cm(E) iw_cm(E) ib_core(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) jitterentropy_rng(E) drbg(E) ansi_cprng(E) ppdev(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8139too(E) joydev(E) 8139cp(E) pcspkr(E) virtio_balloon(E) i2c_piix4(E) mii(E) parport_pc(E) parport(E) pvpanic(E) button(E) processor(E) ata_generic(E) ext4(E) crc16(E) jbd2(E) mbcache(E) ata_piix(E) ahci(E) libahci(E) virtio_blk(E) floppy(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) usb_common(E) libata(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) [last unloaded: lnet_selftest]
      [76114.938788] Supported: No, Unsupported modules are loaded
      [76114.939342] CPU: 0 PID: 28809 Comm: nfsd Tainted: G           OE   N  4.4.103-6.33_lustre-default #1
      [76114.940280] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [76114.940875] task: ffff88007c3254c0 ti: ffff88002c48c000 task.ti: ffff88002c48c000
      [76114.941635] RIP: 0010:[<ffffffffa13ec095>]  [<ffffffffa13ec095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
      [76114.942668] RSP: 0018:ffff88002c48fd90  EFLAGS: 00010246
      [76114.943220] RAX: ffffffffa14094c0 RBX: 0000000000008000 RCX: 0000000000000000
      [76114.943945] RDX: ffffffffa140b05b RSI: 0000000000000000 RDI: ffffffffa14094c0
      [76114.944670] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [76114.945399] R10: 0000000000000000 R11: ffff88006a9320fb R12: ffff880065cfce50
      [76114.946129] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffa140b05b
      [76114.946859] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [76114.947677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [76114.948266] CR2: 0000000000000030 CR3: 000000007a2e6000 CR4: 00000000000406f0
      [76114.948994] Stack:
      [76114.949215]  ffffffffa13b499e 0000000000000000 ffff880065d1f000 ffff88002c48fe00
      [76114.950059]  ffff88003b8c0000 ffff88003b8c0008 ffff880065cfce50 ffff880078199900
      [76114.950904]  ffffffffa1479d8f ffff8800442dc000 ffff8800442dc000 ffff880044086018
      [76114.951816] Call Trace:
      [76114.952142]  [<ffffffffa13b499e>] ll_set_acl+0xee/0x370 [lustre]
      [76114.952790]  [<ffffffffa1479d8f>] nfsd3_proc_setacl+0x19f/0x260 [nfsd]
      [76114.953482]  [<ffffffffa1469e23>] nfsd_dispatch+0xc3/0x260 [nfsd]
      [76114.954139]  [<ffffffffa05811f8>] svc_process_common+0x418/0x6a0 [sunrpc]
      [76114.954843]  [<ffffffffa058157d>] svc_process+0xfd/0x1b0 [sunrpc]
      [76114.955483]  [<ffffffffa146989a>] nfsd+0xea/0x160 [nfsd]
      [76114.956044]  [<ffffffff8109bb57>] kthread+0xc7/0xe0
      [76114.956552]  [<ffffffff8160c67f>] ret_from_fork+0x3f/0x70
      [76114.958488] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
      [76114.959116] 
      [76114.959290] Leftover inexact backtrace:
      [76114.959290] 
      [76114.959848]  [<ffffffff8109ba90>] ? kthread_park+0x50/0x50
      [76114.960412] Code: c0 c7 05 63 3e 04 00 00 00 04 00 e8 96 d0 3b ff 48 c7 c7 c0 fe 42 a1 e8 7a 26 3b ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> 8b 76 30 e9 32 ed ff ff 66 90 66 66 66 66 90 41 55 41 54 55 
      [76114.963792] RIP  [<ffffffffa13ec095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
      [76114.964576]  RSP <ffff88002c48fd90>
      [76114.964949] CR2: 0000000000000030
      

      In a different test session, https://testing.hpdd.intel.com/test_sets/bba5459c-fc9f-11e7-a10a-52540065bddc , we see the same hang, but there is a little more output in the MDS console log:

      [32570.101935] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 10:52:35 (1516301555)
      [32570.288540] LustreError: 18012:0:(namei.c:87:ll_set_inode()) Can not initialize inode [0x240000404:0x1:0x0] without object type: valid = 0x100000001
      [32570.288547] LustreError: 18012:0:(llite_lib.c:2354:ll_prep_inode()) new_inode -fatal: rc -12
      [32570.402366] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [32570.402366] IP: [<ffffffffa1461095>] ll_xattr_set_common_4_3+0x5/0x10 [lustre]
      

      I can confirm that this test started hanging with this NULL pointer dereference on at least 2018-01-16. This test hangs so often that I can’t review all test hangs to see exactly when this started happening.

      Logs for some of the failures are at
      https://testing.hpdd.intel.com/test_sets/ebd266f6-fac8-11e7-bd00-52540065bddc
      https://testing.hpdd.intel.com/test_sets/9dfe1932-fd36-11e7-a6ad-52540065bddc
      https://testing.hpdd.intel.com/test_sets/ce6de31a-fe15-11e7-a10a-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              jhammond John Hammond
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: