[LU-9465] Kernel NULL pointer: osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17 Created: 08/May/17 Updated: 08/Jul/17 Resolved: 10/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak stress cluster |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Bugzilla ID: | 9,394 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
soak-7 survived several failovers, last failover at 2017-05-07 07:41:31 May 7 18:22:39 soak-7 kernel: LustreError: 11-0: soaked-MDT0003-lwp-OST0011: operation obd_ping to node 192.168.1.110@o2ib10 failed: rc = -107 May 7 18:22:39 soak-7 kernel: Lustre: soaked-MDT0003-lwp-OST0005: Connection to soaked-MDT0003 (at 192.168.1.110@o2ib10) was lost; in progress operations using this service will wait for recovery to complete May 7 18:22:39 soak-7 kernel: Lustre: Skipped 2 previous similar messages May 7 18:22:39 soak-7 kernel: LustreError: Skipped 3 previous similar messages May 7 18:23:21 soak-7 kernel: LNet: 228:0:(o2iblnd_cb.c:2421:kiblnd_passive_connect()) Conn stale 192.168.1.111@o2ib10 version 12/12 incarnation 1494181401470091/1494181401470091 May 7 18:23:21 soak-7 kernel: Lustre: soaked-OST0005: Connection restored to (at 192.168.1.111@o2ib10) May 7 18:23:21 soak-7 kernel: Lustre: Skipped 2 previous similar messages May 7 18:23:22 soak-7 kernel: LNet: 7422:0:(o2iblnd_cb.c:1377:kiblnd_reconnect_peer()) Abort reconnection of 192.168.1.111@o2ib10: connected May 7 18:23:29 soak-7 kernel: LustreError: 167-0: soaked-MDT0003-lwp-OST0011: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. May 7 18:23:29 soak-7 kernel: LustreError: Skipped 1 previous similar message May 7 18:23:43 soak-7 kernel: Lustre: soaked-OST0005: deleting orphan objects from 0x440000401:26279429 to 0x440000401:26291121 May 7 18:23:43 soak-7 kernel: Lustre: soaked-OST0011: deleting orphan objects from 0x780000401:26209136 to 0x780000401:26218273 May 7 18:23:43 soak-7 kernel: Lustre: soaked-OST000b: deleting orphan objects from 0x5c0000400:26329949 to 0x5c0000400:26339745 May 7 18:23:43 soak-7 kernel: Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000401:26229632 to 0x8c0000401:26238017 May 7 18:23:54 soak-7 kernel: LustreError: 167-0: soaked-MDT0003-lwp-OST000b: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. May 7 18:23:54 soak-7 kernel: Lustre: soaked-MDT0003-lwp-OST0017: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) Then, a hard crash [38854.133273] Lustre: soaked-MDT0003-lwp-OST0017: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
[38854.147850] Lustre: Skipped 3 previous similar messages
[55622.538966] perf: interrupt took too long (5010 > 5007), lowering kernel.perf_event_max_sample_rate to 39000
[60371.183844] LustreError: 16407:0:(osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17
[60371.201275] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
[60371.211442] IP: [<ffffffffa0a0d328>] lu_object_find_try+0x178/0x2b0 [obdclass]
[60371.221570] PGD 0
[60371.225825] Oops: 0000 [#1] SMP
There is a crash dump available on the node, vmcore-dmesg attached. |
| Comments |
| Comment by Peter Jones [ 08/May/17 ] |
|
Lai Could you please advise on this one? Thanks Peter |
| Comment by James Nunez (Inactive) [ 09/May/17 ] |
|
Just a couple of notes for this ticket: |
| Comment by Cliff White (Inactive) [ 09/May/17 ] |
|
Restarted testing with latest master, 3577
MDS reported a single error: May 9 20:08:39 soak-8 kernel: LustreError: 5550:0:(lfsck_lib.c:2680:lfsck_load_one_trace_file()) soaked-MDT0000-osd: unlink lfsck sub trace file lfsck_namespace_00: rc = 0 Before soak hits timeout, MDS has wedged: May 9 20:12:41 soak-8 kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:5551]
May 9 20:12:41 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin ipmi_ssif sb_edac ipmi_devintf ntb sg iTCO_wdt ioatdma shpchp edac_core mei_me iTCO_vendor_support mei lpc_ich ipmi_si pcspkr i2c_i801
May 9 20:12:41 soak-8 kernel: ipmi_msghandler wmi nfsd dm_multipath dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm ahci crct10dif_pclmul crct10dif_common ptp libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
May 9 20:12:41 soak-8 kernel: CPU: 6 PID: 5551 Comm: OI_scrub Tainted: P OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1
May 9 20:12:41 soak-8 kernel: Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
May 9 20:12:41 soak-8 kernel: task: ffff88083fde6dd0 ti: ffff880703600000 task.ti: ffff880703600000
May 9 20:12:41 soak-8 kernel: RIP: 0010:[<ffffffffa121d1d9>] [<ffffffffa121d1d9>] osd_inode_iteration+0x489/0xcc0 [osd_ldiskfs]
May 9 20:12:41 soak-8 kernel: RSP: 0018:ffff880703603d18 EFLAGS: 00000293
May 9 20:12:41 soak-8 kernel: RAX: 0000000000000004 RBX: 0000000023f30a01 RCX: 0000000000000000
May 9 20:12:41 soak-8 kernel: RDX: ffff880703603d78 RSI: ffff8800b2a36000 RDI: ffff8803162f6000
May 9 20:12:41 soak-8 kernel: RBP: ffff880703603df0 R08: ffff880703603d57 R09: 0000000000000004
May 9 20:12:41 soak-8 kernel: R10: 0000000023f30a01 R11: ffffea000c8fcc00 R12: 0000000023f30a01
May 9 20:12:41 soak-8 kernel: R13: ffff880703603d08 R14: 0000000023f30a01 R15: ffff880703603d08
May 9 20:12:41 soak-8 kernel: FS: 0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000
May 9 20:12:41 soak-8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 9 20:12:41 soak-8 kernel: CR2: 00007f64d55202e0 CR3: 00000000019be000 CR4: 00000000000407e0
May 9 20:12:41 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 9 20:12:41 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 9 20:12:41 soak-8 kernel: Stack:
May 9 20:12:41 soak-8 kernel: ffffffffa121b990 ffffffffa1217a20 ffff8800b2a36000 00000000810d354f
May 9 20:12:41 soak-8 kernel: ffff8803162f6000 ffff8800b2a37468 0000000020000000 010000000000000c
May 9 20:12:41 soak-8 kernel: 0000000000000000 0000000000000000 ffff8800b2a36000 0000000000000000
May 9 20:12:41 soak-8 kernel: Call Trace:
May 9 20:12:41 soak-8 kernel: [<ffffffffa121b990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
May 9 20:12:41 soak-8 kernel: [<ffffffffa1217a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]
May 9 20:12:41 soak-8 kernel: [<ffffffffa121e370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
May 9 20:12:41 soak-8 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
May 9 20:12:41 soak-8 kernel: [<ffffffffa121da10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]
May 9 20:12:41 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
May 9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
May 9 20:12:41 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
May 9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
May 9 20:12:41 soak-8 kernel: Code: 00 e8 7c eb 97 ff e9 0f fc ff ff 0f 1f 80 00 00 00 00 45 89 e9 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 38 ff ff ff <48> 8b bd 48 ff ff ff 48 8b 85 28 ff ff ff ff d0 85 c0 41 89 c5
lfsck hit 600 second timeout, abort attempted: 2017-05-09 20:18:46,982:fsmgmt.fsmgmt:ERROR lfsck still running after 600s, aborting 2017-05-09 20:18:46,983:fsmgmt.fsmgmt:INFO executing cmd: lctl lfsck_stop -M soaked-MDT0000 One the MDS hits the lockup, it hits it over and over and within minutes is doing nothing other that hitting the lockup. Crash dump is available on soak-8 in /var/crash. |
| Comment by Lai Siyao [ 10/May/17 ] |
|
It looks to be a code error, I will make a patch soon. |
| Comment by Lai Siyao [ 10/May/17 ] |
|
ahh, just found it's fixed in |
| Comment by Lai Siyao [ 10/May/17 ] |
|
sorry, I tried to mark it duplicate of |
| Comment by Cliff White (Inactive) [ 10/May/17 ] |
|
I tested master build 3577 and hit a very similar problem as reported above. I don't think it's fixed |
| Comment by Cliff White (Inactive) [ 10/May/17 ] |
|
Tests of latest master build failed. That's not a fix |
| Comment by Andreas Dilger [ 10/May/17 ] |
|
The |
| Comment by Cliff White (Inactive) [ 10/May/17 ] |
|
I have created |
| Comment by James Nunez (Inactive) [ 10/May/17 ] |
|
This issue should be fixed with the patch for |