Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.10.0
-
Soak performance cluster.
-
3
-
9223372036854775807
Description
Soak cluster had completed MDS failover of MDT0002, lfsck was aborted on the MDS after timeout.
soak-2 OSS node crashed while under load.
crash dump is available on soak-2, dmesg attached.
Unfortunately console logged had failed, so some data is not available.
Syslog:
Apr 22 15:02:44 soak-2 kernel: LustreError: Skipped 3 previous similar messages Apr 22 15:02:44 soak-2 kernel: Lustre: soaked-MDT0002-lwp-OST0006: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10) Apr 22 15:02:44 soak-2 kernel: Lustre: Skipped 4 previous similar messages Apr 22 15:03:09 soak-2 kernel: LustreError: 167-0: soaked-MDT0002-lwp-OST0012: This client was evicted by soaked-MDT0002; in progress operations using this service will fail. Apr 22 15:03:09 soak-2 kernel: LustreError: Skipped 1 previous similar message Apr 22 15:03:09 soak-2 kernel: Lustre: soaked-MDT0002-lwp-OST0012: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10) Apr 22 15:03:09 soak-2 kernel: Lustre: Skipped 1 previous similar message Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST0006: deleting orphan objects from 0x480000401:15286395 to 0x480000401:15292401 Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x300000400:15323501 to 0x300000400:15333313 Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x600000400:15327168 to 0x600000400:15338321 Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST0012: deleting orphan objects from 0x740000402:15313619 to 0x740000402:15327665 Apr 22 15:21:56 soak-2 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="3806" x-info="http://www.rsyslog.com"] start
Dmesg from vmcore:
32845.239323] Lustre: soaked-OST000c: deleting orphan objects from 0x600000400:15327168 to 0x600000400:15338321 [32845.239325] Lustre: soaked-OST0012: deleting orphan objects from 0x740000402:15313619 to 0x740000402:15327665 [33465.017357] LustreError: 15131:0:(osd_object.c:427:osd_object_init()) soaked-OST000c: lookup [0x1000c0000:0x21c57a8:0x0]/0x387ef2 failed: rc = 17 [33465.035118] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011 [33465.045485] IP: [<ffffffffa0a590d8>] lu_object_find_try+0x178/0x2b0 [obdclass] [33465.055153] PGD 0 [33465.058735] Oops: 0000 [#1] SMP [33465.063750] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin iTCO_wdt iTCO_vendor_support mei_me pcspkr ses ipmi_devintf enclosure mei ntb sg ipmi_si i2c_i801 lpc_ich sb_edac ipmi_msghandler ioatdma edac_core shpchp wmi nfsd dm_multipath nfs_acl dm_mod lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache [33465.151994] jbd2 zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb isci ttm ahci crct10dif_pclmul crct10dif_common libsas ptp crc32c_intel libahci pps_core mlx4_core drm mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes [33465.195938] CPU: 15 PID: 15131 Comm: ldlm_cn01_018 Tainted: P OE ------------ 3.10.0-514.10.2.el7_lustre.x86_64 #1 [33465.211995] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [33465.225976] task: ffff88042a2dbec0 ti: ffff88073b4f8000 task.ti: ffff88073b4f8000 [33465.236319] RIP: 0010:[<ffffffffa0a590d8>] [<ffffffffa0a590d8>] lu_object_find_try+0x178/0x2b0 [obdclass] [33465.249161] RSP: 0018:ffff88073b4fbaa8 EFLAGS: 00010207 [33465.256998] RAX: 0000000000000011 RBX: ffff880825cd6cc0 RCX: ffff8800ad22a918 [33465.266794] RDX: 00000001002a0021 RSI: ffffea0009502a80 RDI: ffff8800ad22a930 [33465.276797] RBP: ffff88073b4fbb08 R08: ffff8802540aaa80 R09: 00000001002a0020 [33465.286649] R10: 00000000540aad01 R11: ffffea0009502a80 R12: ffff8803fd4d5020 [33465.296218] R13: ffff8804b928e1f0 R14: 0000000000000000 R15: ffff88080b3d9000 [33465.305840] FS: 0000000000000000(0000) GS:ffff88082d9c0000(0000) knlGS:0000000000000000 [33465.316722] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [33465.324590] CR2: 0000000000000011 CR3: 00000000019ba000 CR4: 00000000000407e0 [33465.333919] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [33465.343445] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [33465.353428] Stack: [33465.357274] ffff88073b4fbc00 fffffffffffffffe ffff88073b4fbb30 000000000002246f [33465.367105] ffff8800ad22a800 ffff88000000000d 00000000a848b037 ffff88073b4fbc00 [33465.376714] ffff88080b3d9000 ffff8803fd4d5020 0000000000000000 00000001000c0000 [33465.386284] Call Trace: [33465.390755] [<ffffffffa0a592bc>] lu_object_find_at+0xac/0xe0 [obdclass] [33465.400101] [<ffffffffa0fb27db>] ? ofd_key_init+0x3b/0xd0 [ofd] [33465.408740] [<ffffffffa0a59306>] lu_object_find+0x16/0x20 [obdclass] [33465.417192] [<ffffffffa0fca4e5>] ofd_object_find+0x35/0x100 [ofd] [33465.425447] [<ffffffffa0fd8d12>] ofd_lvbo_update+0x4b2/0xfc8 [ofd] [33465.434020] [<ffffffffa0fd7ca9>] ? ofd_lvbo_free+0x59/0xe0 [ofd] [33465.442680] [<ffffffffa0c566f2>] ldlm_request_cancel+0x132/0x720 [ptlrpc] [33465.451967] [<ffffffffa0c5dc6a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [33465.461254] [<ffffffffa0c5df41>] ldlm_cancel_handler+0x141/0x490 [ptlrpc] [33465.470567] [<ffffffffa0c8df1b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [33465.480874] [<ffffffffa09359a8>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [33465.489808] [<ffffffffa0c8bac8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [33465.498533] [<ffffffff810c5092>] ? default_wake_function+0x12/0x20 [33465.507323] [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90 [33465.515728] [<ffffffffa0c91f40>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [33465.524220] [<ffffffffa0c914a0>] ? ptlrpc_register_service+0xe60/0xe60 [ptlrpc] [33465.534131] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [33465.540714] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [33465.549134] [<ffffffff81696c98>] ret_from_fork+0x58/0x90 [33465.556096] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [33465.564386] Code: c6 62 e0 48 83 f8 fe 0f 85 54 ff ff ff 48 8b 7d a0 4c 89 f1 4c 89 e2 4c 89 fe e8 24 fb ff ff 48 3d 00 f0 ff ff 0f 87 36 ff ff ff <48> 8b 38 ba 10 00 00 00 4c 89 e6 48 89 45 a8 e8 04 80 8c e0 85 [33465.588258] RIP [<ffffffffa0a590d8>] lu_object_find_try+0x178/0x2b0 [obdclass] [33465.597871] RSP <ffff88073b4fbaa8>
Attachments
Issue Links
- is related to
-
LU-9465 Kernel NULL pointer: osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17
-
- Resolved
-