Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7935

MDS crash with NULL pointer dereference at 0000000000000010

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.8.0
    • lola
      build: 2.8 GA + patches
    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324). DNE is enabled. MDTs had been formatted with ldiskfs, OSTs using zfs. MDSes are configured in HA active-active failover configuration with 1 MDT per MDS.
      Nodes lola-8 and lola-9 form a HA cluster.

      Sequence of events:

      • 2016-03-28 11:38:34 triggering fault mds_failover (lola-8)
        rebooting node lola-8
      • 2016-03-28 11:44:55 lola-8 up again
      • 2016-03-28 11:46:04 MDT0000 mounted on lola-9
      • 2016-03-28 11:47:06,056:fsmgmt.fsmgmt:INFO Node lola-9: 'soaked-MDT0000' recovery completed
      • 2016-03-28 11:47:06,056:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0000 ... (aka umount MDT0000 on lola-9)
      • lola-9 crashed with message:
        <4>NULL pointer dereference at 0000000000000010
        <1>IP: [<ffffffffa084bbb7>] lu_context_key_get+0x17/0x60 [obdclass]
        <4>PGD 0 
        <4>Oops: 0000 [#1] SMP 
        <4>last sysfs file: /sys/devices/system/cpu/online
        <4>CPU 10 
        <4>Modules linked in: mgs(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
        <4>
        <4>Pid: 11450, comm: osp_up2-0 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.g2aa02ca.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
        <4>RIP: 0010:[<ffffffffa084bbb7>]  [<ffffffffa084bbb7>] lu_context_key_get+0x17/0x60 [obdclass]
        <4>RSP: 0018:ffff8807ae5997f0  EFLAGS: 00010246
        <4>RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8807ae5998b0
        <4>RDX: 0000000280023695 RSI: ffffffffa0cdf9a0 RDI: 0000000000000000
        <4>RBP: ffff8807ae5997f0 R08: ffff8807ae599918 R09: ffff8807e5962290
        <4>R10: 0000000000000007 R11: 2000000000000000 R12: ffff8803f4ed6cc0
        <4>R13: 0000000280023695 R14: ffff8807ae5998b0 R15: ffff8807b15a1b50
        <4>FS:  0000000000000000(0000) GS:ffff88044e440000(0000) knlGS:0000000000000000
        <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        <4>CR2: 0000000000000010 CR3: 0000000001a85000 CR4: 00000000000407e0
        <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        <4>Process osp_up2-0 (pid: 11450, threadinfo ffff8807ae598000, task ffff8807af36cab0)
        <4>Stack:
        <4> ffff8807ae599840 ffffffffa0cd883f ffff880000033c28 0006125000000001
        <4><d> 0000000000000246 ffff8803d9081498 0000000280023695 ffff8803f4ed6cc0
        <4><d> ffff8807ae5998b0 0000000000000000 ffff8807ae599890 ffffffffa0cd8c83
        <4>Call Trace:
        <4> [<ffffffffa0cd883f>] fld_local_lookup+0x4f/0x290 [fld]
        <4> [<ffffffffa0cd8c83>] fld_server_lookup+0x53/0x330 [fld]
        <4> [<ffffffffa123738f>] lod_fld_lookup+0x34f/0x520 [lod]
        <4> [<ffffffff811753da>] ? kmem_cache_alloc+0x18a/0x190
        <4> [<ffffffffa124d243>] lod_object_init+0x103/0x3c0 [lod]
        <4> [<ffffffffa084f1f8>] lu_object_alloc+0xd8/0x320 [obdclass]
        <4> [<ffffffffa08505e1>] lu_object_find_try+0x151/0x260 [obdclass]
        <4> [<ffffffffa08507a1>] lu_object_find_at+0xb1/0xe0 [obdclass]
        <4> [<ffffffffa084f093>] ? lu_object_free+0x113/0x1a0 [obdclass]
        <4> [<ffffffffa085080f>] lu_object_find_slice+0x1f/0x80 [obdclass]
        <4> [<ffffffffa1342a4e>] osp_trans_stop_cb+0x1be/0x2d0 [osp]
        <4> [<ffffffffa13442be>] osp_update_interpret+0x21e/0x4a0 [osp]
        <4> [<ffffffff8108742c>] ? lock_timer_base+0x3c/0x70
        <4> [<ffffffffa0a600e5>] ptlrpc_check_set+0x615/0x1da0 [ptlrpc]
        <4> [<ffffffff8152b22a>] ? schedule_timeout+0x19a/0x2e0
        <4> [<ffffffffa0a61bca>] ptlrpc_set_wait+0x35a/0x960 [ptlrpc]
        <4> [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
        <4> [<ffffffffa0a6de85>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
        <3>LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 0@lo failed: rc = -107
        <3>LustreError: Skipped 752 previous similar messages
        <4>Lustre: soaked-MDT0000-osp-MDT0001: Connection to soaked-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
        <4> [<ffffffffa0a62251>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
        <4> [<ffffffffa13449c6>] osp_send_update_req+0x256/0x850 [osp]
        <4> [<ffffffffa134130c>] ? osp_get_next_request+0xfc/0x1a0 [osp]
        <4> [<ffffffffa134563f>] osp_send_update_thread+0x20f/0x7ac [osp]
        <4> [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
        <4> [<ffffffffa1345430>] ? osp_send_update_thread+0x0/0x7ac [osp]
        <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
        <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        <4>Code: c4 38 5b 41 5c 41 5d c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 48 63 46 20 48 3b 34 c5 e0 26 8d a0 75 0a <48> 8b 57 10 48 8b 04 c2 c9 c3 48 c7 c7 00 b3 8a a0 48 c7 c2 c8 
        <1>RIP  [<ffffffffa084bbb7>] lu_context_key_get+0x17/0x60 [obdclass]
        <4> RSP <ffff8807ae5997f0>
        <4>CR2: 0000000000000010
        

      Attached files:
      console, messages and vmcore-dmesg.txt of node lola-9

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: