Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11038

replay-dual test_26: MDS crash with BUG: unable to handle kernel NULL pointer dereference

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.4
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/b045d08a-5b46-11e8-93e6-52540065bddc

      test_26 failed with the following error:

      Test crashed during replay-dual test_26
      

      Env: 2.10.4-RC2 SLES12SP3 server/client.

       

      The same test passed on 2.10.4-RC1

      [40924.262982] Lustre: DEBUG MARKER: dmesg
      [40924.661415] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-dual test 26: dbench and tar with mds failover ============================================= 02:22:47 \(1526721767\)
      [40924.703962] Lustre: DEBUG MARKER: == replay-dual test 26: dbench and tar with mds failover ============================================= 02:22:47 (1526721767)
      [40924.817912] Lustre: MGS: Received new LWP connection from 10.9.6.10@tcp, removing former export from same NID
      [40925.069705] LustreError: 166-1: MGC10.9.6.12@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      [40925.070624] LustreError: 1454:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.9.6.12@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff8800400104c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [40925.070626] LustreError: 1454:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
      [40925.070628] LustreError: 1454:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x65727473756c:0x2:0x0].0x0 (ffff8800400104c0) refcount = 2
      [40925.070631] LustreError: 1454:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [40925.070636] LustreError: 1454:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.9.6.12@tcp lock: ffff88003f3f7000/0xbe3e0ba9bf8042d4 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xbe3e0ba9bf8042db expref: -99 pid: 32450 timeout: 0 lvb_type: 0
      [40925.129264] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-6vm10: executing set_default_debug -1 all 4
      [40925.165673] Lustre: DEBUG MARKER: trevis-6vm10: executing set_default_debug -1 all 4
      [40927.211147] Lustre: DEBUG MARKER: sync; sync; sync
      [40927.709545] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      [40927.758867] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
      [40928.000551] Turning device dm-0 (0xfe00000) read-only
      [40928.020838] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      [40928.039076] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      [40930.184043] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_26 fail mds1 1 times
      [40930.255200] Lustre: DEBUG MARKER: test_26 fail mds1 1 times
      [40930.323562] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [40930.404857] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1
      [40930.435375] Lustre: lustre-MDT0000: Not available for connect from 10.9.6.9@tcp (stopping)
      [40930.713797] LustreError: 685:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880068bd96c0 x1600869444311072/t0(0) o13->lustre-OST0002-osc-MDT0000@10.9.6.11@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [40930.713805] LustreError: 685:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
      [40931.602998] Lustre: MGS: Received new LWP connection from 10.9.6.9@tcp, removing former export from same NID
      [40931.603006] Lustre: Skipped 1 previous similar message
      [40932.089774] LustreError: 687:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88006e88e040 x1600869444311152/t0(0) o13->lustre-OST0000-osc-MDT0000@10.9.6.11@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [40932.089784] LustreError: 687:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 5 previous similar messages
      [40932.763746] Lustre: lustre-MDT0000: Not available for connect from 10.9.6.11@tcp (stopping)
      [40932.763750] Lustre: Skipped 1 previous similar message
      [40934.819065] Lustre: lustre-MDT0000: Not available for connect from 10.9.6.10@tcp (stopping)
      [40934.819068] Lustre: Skipped 6 previous similar messages
      [40937.761756] Lustre: lustre-MDT0000: Not available for connect from 10.9.6.11@tcp (stopping)
      [40937.761761] Lustre: Skipped 3 previous similar messages
      [40937.963592] BUG: unable to handle kernel NULL pointer dereference at (null)
      [40937.964686] IP: [<ffffffffa07e33bf>] _ldlm_lock_debug+0x11f/0x710 [ptlrpc]
      [40937.966338] PGD 0 
      [40937.966648] Oops: 0000 [#1] SMP 
      [40937.967150] Modules linked in: osp(OEN) mdd(OEN) lod(OEN) mdt(OEN) lfsck(OEN) mgs(OEN) mgc(OEN) osd_ldiskfs(OEN) ldiskfs(OEN) lquota(OEN) fid(OEN) fld(OEN) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) loop(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rpcrdma(E) sunrpc(E) ib_isert(E) iscsi_target_mod(E) ib_iser(E) libiscsi(E) scsi_transport_iscsi(E) ib_srpt(E) target_core_mod(E) ib_srp(E) scsi_transport_srp(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) configfs(E) ib_cm(E) iw_cm(E) ib_core(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) jitterentropy_rng(E) drbg(E) ansi_cprng(E) aesni_intel(E) 8139too(E) aes_x86_64(E) lrw(E) gf128mul(E)
      [40937.977623] glue_helper(E) 8139cp(E) ablk_helper(E) cryptd(E) joydev(E) mii(E) pcspkr(E) virtio_balloon(E) i2c_piix4(E) button(E) processor(E) ata_generic(E) ext4(E) crc16(E) jbd2(E) mbcache(E) ata_piix(E) virtio_blk(E) ahci(E) libahci(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) libata(E) usb_common(E) floppy(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) [last unloaded: libcfs]
      [40937.983840] Supported: No, Unsupported modules are loaded
      [40937.984556] CPU: 0 PID: 1607 Comm: ldlm_bl_03 Tainted: G OE N 4.4.126-94.22_lustre-default #1
      [40937.985795] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [40937.986538] task: ffff88003f5c5380 ti: ffff880066d80000 task.ti: ffff880066d80000
      [40937.987515] RIP: 0010:[<ffffffffa07e33bf>] [<ffffffffa07e33bf>] _ldlm_lock_debug+0x11f/0x710 [ptlrpc]
      [40937.988776] RSP: 0018:ffff880066d83ac0 EFLAGS: 00010286
      [40937.989461] RAX: ffffffffa0744620 RBX: ffff88007ac21200 RCX: 0000000000000000
      [40937.990376] RDX: 000000000000000a RSI: 000000000000000a RDI: 0000000000000000
      [40937.991287] RBP: ffff880066d83bf0 R08: 0000000000000004 R09: 0000000000007ebf
      [40937.992207] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003d10ad00
      [40937.993124] R13: ffff8800666d8400 R14: ffffffffa08de134 R15: 0009000000000000
      [40937.994038] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [40937.995082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [40937.995830] CR2: 0000000000000000 CR3: 000000006d4dc000 CR4: 0000000000060670
      [40937.996760] Stack:
      [40937.997046] 0000000000000000 ffffffffa08dda07 ffff88006ee4b622 0000000000000000
      [40937.998116] ffffffffa08e78d6 ffff88006ee4b63c 0000000000000000 ffffffffa0788ab8
      [40937.999172] ffff88006ee4b5af ffff880000000010 ffff880066d83c18 ffff880066d83bd8
      [40938.000227] Call Trace:
      [40938.000678] [<ffffffffa07f0bfe>] ldlm_resource_add_lock+0xee/0x1b0 [ptlrpc]
      [40938.001670] [<ffffffffa07ea8e0>] ldlm_handle_conflict_lock+0x230/0x2e0 [ptlrpc]
      [40938.002694] [<ffffffffa07fdd56>] ldlm_process_plain_lock+0x406/0xaf0 [ptlrpc]
      [40938.003668] [<ffffffffa07e9fba>] ldlm_reprocess_queue+0x11a/0x260 [ptlrpc]
      [40938.004600] [<ffffffffa07eaaa3>] __ldlm_reprocess_all+0x113/0x350 [ptlrpc]
      [40938.005531] [<ffffffffa07eb002>] ldlm_reprocess_res+0x22/0x30 [ptlrpc]
      [40938.006502] [<ffffffffa07799a4>] cfs_hash_for_each_relax+0x244/0x430 [libcfs]
      [40938.007484] [<ffffffffa077c952>] cfs_hash_for_each_nolock+0x72/0x1b0 [libcfs]
      [40938.008462] [<ffffffffa07eb048>] ldlm_reprocess_recovery_done+0x38/0x100 [ptlrpc]
      [40938.009478] [<ffffffffa07ebbe4>] ldlm_export_cancel_locks+0xe4/0xf0 [ptlrpc]
      [40938.010436] [<ffffffffa0812dff>] ldlm_bl_thread_main+0x48f/0x690 [ptlrpc]
      [40938.011411] [<ffffffff8109ea99>] kthread+0xc9/0xe0
      [40938.012176] [<ffffffff81618505>] ret_from_fork+0x55/0x80
      [40938.017587] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
      [40938.018338] 
      [40938.018556] Leftover inexact backtrace:
       
      [40938.019262] [<ffffffff8109e9d0>] ? kthread_park+0x50/0x50
      [40938.019974] Code: 8b 9b 50 01 00 00 4c 8b 93 80 01 00 00 44 8b 8b 88 01 00 00 be 9d ff ff ff 74 04 41 8b 75 40 48 8b 4b 48 45 8b 44 24 18 8b 7b 40 <48> 8b 09 48 8b 09 48 8b 49 10 44 89 8c 24 80 00 00 00 49 89 d9 
      [40938.023999] RIP [<ffffffffa07e33bf>] _ldlm_lock_debug+0x11f/0x710 [ptlrpc]
      [40938.024962] RSP <ffff880066d83ac0>
      [40938.025423] CR2: 0000000000000000
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      replay-dual test_26 - Test crashed during replay-dual test_26

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: