Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11718

parallel-scale-nfsv3 test racer_on_nfs crashes with ‘BUG: unable to handle kernel paging request’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
    • RHEL 7.6 servers and RHEL 6.10 clients
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs crashes. Looking at the logs at https://testing.whamcloud.com/test_sets/0093481e-ef54-11e8-815b-52540065bddc , we see the following in the kernel-crash log

      [112389.058995] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client == 17:20:35 \(1542993635\)
      [112389.244554] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client == 17:20:35 (1542993635)
      [112392.193506] BUG: unable to handle kernel paging request at ffffffc0acacf0ff
      [112392.195074] IP: [<ffffffc0acacf0ff>] 0xffffffc0acacf0ff
      [112392.196184] PGD 52a14067 PUD 0 
      [112392.196979] Oops: 0010 [#1] SMP 
      [112392.197736] Modules linked in: nfsd nfs_acl osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon parport_pc parport i2c_piix4 ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul
      [112392.214016]  crct10dif_common crc32c_intel serio_raw floppy 8139too ata_piix libata virtio_pci virtio_ring 8139cp mii virtio [last unloaded: lnet_selftest]
      [112392.216970] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
      [112392.219149] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [112392.220252] task: ffffffffb0618480 ti: ffffffffb0600000 task.ti: ffffffffb0600000
      [112392.221596] RIP: 0010:[<ffffffc0acacf0ff>]  [<ffffffc0acacf0ff>] 0xffffffc0acacf0ff
      [112392.222878] RSP: 0018:ffff8c913fc03eb8  EFLAGS: 00010286
      [112392.223743] RAX: ffffffc0acacf0ff RBX: ffffffffb06784c0 RCX: 0000000009fdbc69
      [112392.224907] RDX: ffff8c911d664827 RSI: fffff79281759900 RDI: ffff8c911d664827
      [112392.226060] RBP: ffff8c913fc03f10 R08: 000000000001f0a0 R09: ffffffffafb5498c
      [112392.227239] R10: ffff8c913fc1f0a0 R11: fffff79281e4a1c0 R12: 000000000000000a
      [112392.228383] R13: 0000000000000013 R14: ff8c9113342e28ff R15: ffff8c913fc162c0
      [112392.229533] FS:  0000000000000000(0000) GS:ffff8c913fc00000(0000) knlGS:0000000000000000
      [112392.230832] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [112392.231763] CR2: ffffffc0acacf0ff CR3: 0000000077d72000 CR4: 00000000000606f0
      [112392.232922] Call Trace:
      [112392.233358]  <IRQ> 
      [112392.233740]  [<ffffffffafb54940>] ? rcu_process_callbacks+0x1e0/0x580
      [112392.234872]  [<ffffffffafaa0f05>] __do_softirq+0xf5/0x280
      [112392.235779]  [<ffffffffb017832c>] call_softirq+0x1c/0x30
      [112392.236658]  [<ffffffffafa2e675>] do_softirq+0x65/0xa0
      [112392.237526]  [<ffffffffafaa1285>] irq_exit+0x105/0x110
      [112392.238384]  [<ffffffffb01796c8>] smp_apic_timer_interrupt+0x48/0x60
      [112392.239423]  [<ffffffffb0175df2>] apic_timer_interrupt+0x162/0x170
      [112392.240426]  <EOI> 
      [112392.240783]  [<ffffffffafadafb0>] ? switched_to_idle+0x10/0x10
      [112392.241802]  [<ffffffffb0169a20>] ? __cpuidle_text_start+0x8/0x8
      [112392.242778]  [<ffffffffb0169c26>] ? native_safe_halt+0x6/0x10
      [112392.243706]  [<ffffffffb0169a3e>] default_idle+0x1e/0xc0
      [112392.244588]  [<ffffffffafa366f0>] arch_cpu_idle+0x20/0xc0
      [112392.245487]  [<ffffffffafafc3ba>] cpu_startup_entry+0x14a/0x1e0
      [112392.246463]  [<ffffffffb014feb7>] rest_init+0x77/0x80
      [112392.247318]  [<ffffffffb07861c6>] start_kernel+0x44b/0x46c
      [112392.248218]  [<ffffffffb0785b7b>] ? repair_env_string+0x5c/0x5c
      [112392.249183]  [<ffffffffb0785120>] ? early_idt_handler_array+0x120/0x120
      [112392.250250]  [<ffffffffb078572f>] x86_64_start_reservations+0x24/0x26
      [112392.251295]  [<ffffffffb0785885>] x86_64_start_kernel+0x154/0x177
      [112392.252324]  [<ffffffffafa000d5>] start_cpu+0x5/0x14
      [112392.253143] Code:  Bad RIP value.
      [112392.253787] RIP  [<ffffffc0acacf0ff>] 0xffffffc0acacf0ff
      [112392.254730]  RSP <ffff8c913fc03eb8>
      [112392.255323] CR2: ffffffc0acacf0ff
      

      The last thing seen in the client test_log is

      == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client == 17:20:35 (1542993635)
      Running /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. CTRL-C to exit
      

      Unfortunately, there’s not much else in the console logs.

      There are similar crashes with similar call traces, but have ll_dir_get_parent_fid() errors before the crash; https://testing.whamcloud.com/test_sets/5a91cd96-e21f-11e8-b67f-52540065bddc . So, it’s not clear is this is the same issue or not.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: