[LU-9465] Kernel NULL pointer: osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17 Created: 08/May/17  Updated: 08/Jul/17  Resolved: 10/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak stress cluster


Attachments: Text File soak-8.console.log     Text File soak-8.vmcore-dmesg.txt     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-9394 lu_object_find_try - kernel NULL poin... Resolved
Severity: 3
Bugzilla ID: 9,394
Rank (Obsolete): 9223372036854775807

 Description   

soak-7 survived several failovers, last failover at 2017-05-07 07:41:31
The soak cluster failed over soak-10 at 2017-05-07 18:23:22
Immediately after finishing recovery, soak-7 crashed.
The OSS is reconnected to the recently failed-over MDT on soak-10/11

May  7 18:22:39 soak-7 kernel: LustreError: 11-0: soaked-MDT0003-lwp-OST0011: operation obd_ping to node 192.168.1.110@o2ib10 failed: rc = -107
May  7 18:22:39 soak-7 kernel: Lustre: soaked-MDT0003-lwp-OST0005: Connection to soaked-MDT0003 (at 192.168.1.110@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
May  7 18:22:39 soak-7 kernel: Lustre: Skipped 2 previous similar messages
May  7 18:22:39 soak-7 kernel: LustreError: Skipped 3 previous similar messages
May  7 18:23:21 soak-7 kernel: LNet: 228:0:(o2iblnd_cb.c:2421:kiblnd_passive_connect()) Conn stale 192.168.1.111@o2ib10 version 12/12 incarnation 1494181401470091/1494181401470091
May  7 18:23:21 soak-7 kernel: Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.111@o2ib10)
May  7 18:23:21 soak-7 kernel: Lustre: Skipped 2 previous similar messages
May  7 18:23:22 soak-7 kernel: LNet: 7422:0:(o2iblnd_cb.c:1377:kiblnd_reconnect_peer()) Abort reconnection of 192.168.1.111@o2ib10: connected
May  7 18:23:29 soak-7 kernel: LustreError: 167-0: soaked-MDT0003-lwp-OST0011: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
May  7 18:23:29 soak-7 kernel: LustreError: Skipped 1 previous similar message
May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST0005: deleting orphan objects from 0x440000401:26279429 to 0x440000401:26291121
May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST0011: deleting orphan objects from 0x780000401:26209136 to 0x780000401:26218273
May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST000b: deleting orphan objects from 0x5c0000400:26329949 to 0x5c0000400:26339745
May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000401:26229632 to 0x8c0000401:26238017
May  7 18:23:54 soak-7 kernel: LustreError: 167-0: soaked-MDT0003-lwp-OST000b: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
May  7 18:23:54 soak-7 kernel: Lustre: soaked-MDT0003-lwp-OST0017: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)

Then, a hard crash

[38854.133273] Lustre: soaked-MDT0003-lwp-OST0017: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
[38854.147850] Lustre: Skipped 3 previous similar messages
[55622.538966] perf: interrupt took too long (5010 > 5007), lowering kernel.perf_event_max_sample_rate to 39000
[60371.183844] LustreError: 16407:0:(osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17
[60371.201275] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
[60371.211442] IP: [<ffffffffa0a0d328>] lu_object_find_try+0x178/0x2b0 [obdclass]
[60371.221570] PGD 0
[60371.225825] Oops: 0000 [#1] SMP

There is a crash dump available on the node, vmcore-dmesg attached.



 Comments   
Comment by Peter Jones [ 08/May/17 ]

Lai

Could you please advise on this one?

Thanks

Peter

Comment by James Nunez (Inactive) [ 09/May/17 ]

Just a couple of notes for this ticket:
LFSCK was turned off
Soak testing ran for approximately 48 hours before the crash
Running master - build #3573 with no other patches applied

Comment by Cliff White (Inactive) [ 09/May/17 ]

Restarted testing with latest master, 3577

  • Competed two failovers with lfsck turned off.
  • Restarted with lfsck turned on
  • soak-5 (OSS) completed failover:
    2017-05-09 20:08:27
    ,327:fsmgmt.fsmgmt:INFO     oss_failover completed, running lfsck

MDS reported a single error:

May  9 20:08:39 soak-8 kernel: LustreError: 5550:0:(lfsck_lib.c:2680:lfsck_load_one_trace_file()) soaked-MDT0000-osd: unlink lfsck sub trace file lfsck_namespace_00: rc = 0

Before soak hits timeout, MDS has wedged:

May  9 20:12:41 soak-8 kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:5551]
May  9 20:12:41 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin ipmi_ssif sb_edac ipmi_devintf ntb sg iTCO_wdt ioatdma shpchp edac_core mei_me iTCO_vendor_support mei lpc_ich ipmi_si pcspkr i2c_i801
May  9 20:12:41 soak-8 kernel: ipmi_msghandler wmi nfsd dm_multipath dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm ahci crct10dif_pclmul crct10dif_common ptp libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
May  9 20:12:41 soak-8 kernel: CPU: 6 PID: 5551 Comm: OI_scrub Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
May  9 20:12:41 soak-8 kernel: Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
May  9 20:12:41 soak-8 kernel: task: ffff88083fde6dd0 ti: ffff880703600000 task.ti: ffff880703600000
May  9 20:12:41 soak-8 kernel: RIP: 0010:[<ffffffffa121d1d9>]  [<ffffffffa121d1d9>] osd_inode_iteration+0x489/0xcc0 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: RSP: 0018:ffff880703603d18  EFLAGS: 00000293
May  9 20:12:41 soak-8 kernel: RAX: 0000000000000004 RBX: 0000000023f30a01 RCX: 0000000000000000
May  9 20:12:41 soak-8 kernel: RDX: ffff880703603d78 RSI: ffff8800b2a36000 RDI: ffff8803162f6000
May  9 20:12:41 soak-8 kernel: RBP: ffff880703603df0 R08: ffff880703603d57 R09: 0000000000000004
May  9 20:12:41 soak-8 kernel: R10: 0000000023f30a01 R11: ffffea000c8fcc00 R12: 0000000023f30a01
May  9 20:12:41 soak-8 kernel: R13: ffff880703603d08 R14: 0000000023f30a01 R15: ffff880703603d08
May  9 20:12:41 soak-8 kernel: FS:  0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000
May  9 20:12:41 soak-8 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  9 20:12:41 soak-8 kernel: CR2: 00007f64d55202e0 CR3: 00000000019be000 CR4: 00000000000407e0
May  9 20:12:41 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May  9 20:12:41 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May  9 20:12:41 soak-8 kernel: Stack:
May  9 20:12:41 soak-8 kernel: ffffffffa121b990 ffffffffa1217a20 ffff8800b2a36000 00000000810d354f
May  9 20:12:41 soak-8 kernel: ffff8803162f6000 ffff8800b2a37468 0000000020000000 010000000000000c
May  9 20:12:41 soak-8 kernel: 0000000000000000 0000000000000000 ffff8800b2a36000 0000000000000000
May  9 20:12:41 soak-8 kernel: Call Trace:
May  9 20:12:41 soak-8 kernel: [<ffffffffa121b990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffffa1217a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffffa121e370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
May  9 20:12:41 soak-8 kernel: [<ffffffffa121da10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
May  9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
May  9 20:12:41 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
May  9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
May  9 20:12:41 soak-8 kernel: Code: 00 e8 7c eb 97 ff e9 0f fc ff ff 0f 1f 80 00 00 00 00 45 89 e9 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 38 ff ff ff <48> 8b bd 48 ff ff ff 48 8b 85 28 ff ff ff ff d0 85 c0 41 89 c5

lfsck hit 600 second timeout, abort attempted:

2017-05-09 20:18:46,982:fsmgmt.fsmgmt:ERROR    lfsck still running after 600s, aborting
2017-05-09 20:18:46,983:fsmgmt.fsmgmt:INFO     executing cmd: lctl lfsck_stop -M soaked-MDT0000

One the MDS hits the lockup, it hits it over and over and within minutes is doing nothing other that hitting the lockup.
At this point, decided not to wait for the crash, dumped stacks, them forced a crash dump.

Crash dump is available on soak-8 in /var/crash.
console log, vmcore-dmesg are attached.

Comment by Lai Siyao [ 10/May/17 ]

It looks to be a code error, I will make a patch soon.

Comment by Lai Siyao [ 10/May/17 ]

ahh, just found it's fixed in LU-9394 already, and the fix is in latest master.

Comment by Lai Siyao [ 10/May/17 ]

sorry, I tried to mark it duplicate of LU-9394, but I can't fill 'LU-9394' in Bugzilla-Id field. Can someone help me fix it?

Comment by Cliff White (Inactive) [ 10/May/17 ]

I tested master build 3577 and hit a very similar problem as reported above. I don't think it's fixed

Comment by Cliff White (Inactive) [ 10/May/17 ]

Tests of latest master build failed. That's not a fix

Comment by Andreas Dilger [ 10/May/17 ]

The LU-9394 fix is for the original problem reported in this bug, "NULL pointer dereference at lu_object_find_try()". The stack trace added yesterday is for something completely different, "soft lockup in osd_inode_iteration()", which looks to be related to LFSCK/Scrub. If the testing with the latest master build hit the soft lockup in osd_inode_iteration() then it should go into a separate ticket.

Comment by Cliff White (Inactive) [ 10/May/17 ]

I have created LU-9488 and transferred the information there.

Comment by James Nunez (Inactive) [ 10/May/17 ]

This issue should be fixed with the patch for LU-9394

Generated at Sat Feb 10 02:26:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.