Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.7.0
-
None
-
lustre 2.7.1-fe
-
2
-
9223372036854775807
Description
OSS console errors
LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
LNet: 46045:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
LNet: 56154:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
------------[ cut here ]------------^M
WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)^M
Hardware name: SUMMIT^M
list_del corruption. prev->next should be ffff881d63ead4d0, but was (null)^M
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 acpi_cpufreq freq_table mperf lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc scsi_transport_fc scsi_tgt sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_mirror dm_region_hash dm_log dm_multipath dm_mod iTCO_wdt iTCO_vendor_support microcode sg wmi igb hwmon dca i2c_algo_bit ptp pps_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_bic ext3 jbd sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) ahci gru [last unloaded: scsi_wait_scan]^M
Pid: 8603, comm: kiblnd_sd_02_01 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1^M
Call Trace:^M
[<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0^M
[<ffffffff81074216>] ? warn_slowpath_fmt+0x46/0x50^M
[<ffffffff812bda6e>] ? list_del+0x6e/0xa0^M
[<ffffffffa052c5c9>] ? lnet_me_unlink+0x39/0x140 [lnet]^M
[<ffffffffa05303f8>] ? lnet_md_unlink+0x2f8/0x3e0 [lnet]^M
[<ffffffffa0531b9f>] ? lnet_try_match_md+0x22f/0x310 [lnet]^M
[<ffffffffa0a1f727>] ? kiblnd_recv+0x107/0x780 [ko2iblnd]^M
[<ffffffffa0531d1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]^M
[<ffffffffa0532621>] ? lnet_ptl_match_md+0x281/0x870 [lnet]^M
[<ffffffffa05396e7>] ? lnet_parse_local+0x307/0xc60 [lnet]^M
[<ffffffffa053a6da>] ? lnet_parse+0x69a/0xcf0 [lnet]^M
[<ffffffffa0a1ff3b>] ? kiblnd_handle_rx+0x19b/0x620 [ko2iblnd]^M
[<ffffffffa0a212be>] ? kiblnd_scheduler+0xefe/0x10d0 [ko2iblnd]^M
[<ffffffff81064f90>] ? default_wake_function+0x0/0x20^M
[<ffffffffa0a203c0>] ? kiblnd_scheduler+0x0/0x10d0 [ko2iblnd]^M
[<ffffffff8109dc8e>] ? kthread+0x9e/0xc0^M
[<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
[<ffffffff8109dbf0>] ? kthread+0x0/0xc0^M
[<ffffffff8100c280>] ? child_rip+0x0/0x20^M
---[ end trace 1063d2ffc2578a2f ]---^M
------------[ cut here ]------------^M
From the crash dump bt looks like this.
PID: 8603 TASK: ffff8810271fa040 CPU: 11 COMMAND: "kiblnd_sd_02_01"
#0 [ffff880ff8b734f0] machine_kexec at ffffffff8103b5db
#1 [ffff880ff8b73550] crash_kexec at ffffffff810c9412
#2 [ffff880ff8b73620] kdb_kdump_check at ffffffff812973d7
#3 [ffff880ff8b73630] kdb_main_loop at ffffffff8129a5c7
#4 [ffff880ff8b73740] kdb_save_running at ffffffff8129472e
#5 [ffff880ff8b73750] kdba_main_loop at ffffffff8147cd68
#6 [ffff880ff8b73790] kdb at ffffffff812978c6
#7 [ffff880ff8b73800] kdba_entry at ffffffff8147c687
#8 [ffff880ff8b73810] notifier_call_chain at ffffffff81568515
#9 [ffff880ff8b73850] atomic_notifier_call_chain at ffffffff8156857a
#10 [ffff880ff8b73860] notify_die at ffffffff810a44fe
#11 [ffff880ff8b73890] __die at ffffffff815663e2
#12 [ffff880ff8b738c0] no_context at ffffffff8104c822
#13 [ffff880ff8b73910] __bad_area_nosemaphore at ffffffff8104cad5
#14 [ffff880ff8b73960] bad_area_nosemaphore at ffffffff8104cba3
#15 [ffff880ff8b73970] __do_page_fault at ffffffff8104d29c
#16 [ffff880ff8b73a90] do_page_fault at ffffffff8156845e
#17 [ffff880ff8b73ac0] page_fault at ffffffff81565765
[exception RIP: lnet_mt_match_md+135]
RIP: ffffffffa0531d07 RSP: ffff880ff8b73b70 RFLAGS: 00010286
RAX: ffff881d88420000 RBX: ffff880ff8b73c70 RCX: 0000000000000007
RDX: 0000000000000004 RSI: ffff880ff8b73c70 RDI: ffffffffffffffff
RBP: ffff880ff8b73bb0 R8: 0000000000000001 R9: d400000000000000
R10: 0000000000000001 R11: 0000000000000012 R12: 0000000000000000
R13: ffff881730ca6200 R14: 00d100120be91b91 R15: 0000000000000008
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#18 [ffff880ff8b73bb8] lnet_ptl_match_md at ffffffffa0532621 [lnet]
#19 [ffff880ff8b73c38] lnet_parse_local at ffffffffa05396e7 [lnet]
#20 [ffff880ff8b73cd8] lnet_parse at ffffffffa053a6da [lnet]
#21 [ffff880ff8b73d68] kiblnd_handle_rx at ffffffffa0a1ff3b [ko2iblnd]
#22 [ffff880ff8b73db8] kiblnd_scheduler at ffffffffa0a212be [ko2iblnd]
#23 [ffff880ff8b73ee8] kthread at ffffffff8109dc8e
#24 [ffff880ff8b73f48] kernel_thread at ffffffff8100c28a
Attachments
Issue Links
- is related to
-
LU-4330 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed
-
- Reopened
-
-
LU-7980 Overrun in generic <size-128> kmem_cache Slabs causing OSS to crash
-
- Resolved
-
Many thanks again Jay.
With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the "prev" linked-list of MEs shown in "list -o 8 0xffff881d8d8bff40") preceding the crash.
But what I can say now, is that it finally looks very similar to several occurrences I have examined during
LU-7980tracking.In order to find a definitive proof of what I presume, can you provide the "kmem ffff882029ab2578", "rd ffff881659b2c2c0 32", "rd ffff881d88425c40 32", and "rd ffff88161caf1040 32" crash sub-cmds output?
Also, during
LU-7980tracking I have used patch for LU-4330 to move LNet MEs/small-MDs out of <size-128> Slabs, is it something that you may also try at your site ?