[LU-10103] LBUG: lib-move.c:2121:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed Created: 06/Oct/17  Updated: 23/Mar/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: soak
Environment:

Soak test cluster


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Testing https://review.whamcloud.com/29341.(Revert patch for LU-9810 to determine if preferring
Fast Reg breaks mounting targets.)
System mounts fine (LU-10068) - but after a few hours, routers have LBUG:

Oct  5 16:25:31 soak-14 kernel: LNet: 2153:0:(o2iblnd_modparams.c:253:kiblnd_tunables_setup()) Invalid map_on_demand (0), expects 1 - 256. Using default of 256
Oct  5 16:25:31 soak-14 kernel: LNet: Using FMR for registration
Oct  5 16:25:31 soak-14 kernel: LNetError: 4:0:(o2iblnd_cb.c:2304:kiblnd_passive_connect()) Can't accept conn from 192.168.1.121@o2ib on NA (ib1:0:192.168.1.114): bad dst nid 192.168.1.114@o2ib
Oct  5 16:25:31 soak-14 kernel: LNet: Added LNI 192.168.1.114@o2ib [8/256/0/180]
Oct  5 16:25:31 soak-14 kernel: LNet: Added LNI 172.16.1.14@o2ib1 [128/2048/0/180]
Oct  5 16:25:31 soak-14 sshd[2130]: Received disconnect from 10.10.1.116 port 38944:11: disconnected by user
Oct  5 16:25:31 soak-14 sshd[2130]: Disconnected from 10.10.1.116 port 38944
Oct  5 16:25:31 soak-14 sshd[2130]: pam_unix(sshd:session): session closed for user root
Oct  5 16:25:31 soak-14 systemd-logind: Removed session 4.
Oct  5 16:25:31 soak-14 systemd: Removed slice User Slice of root.
Oct  5 16:25:31 soak-14 systemd: Stopping User Slice of root.
Oct  5 16:37:04 soak-14 kernel: LNetError: 1979:0:(lib-move.c:2121:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed:
Oct  5 16:37:04 soak-14 kernel: LNetError: 1979:0:(lib-move.c:2121:lnet_send()) LBUG
Oct  5 16:37:04 soak-14 kernel: Pid: 1979, comm: lnet_discovery
Oct  5 16:37:05 soak-14 kernel: #012Call Trace:
Oct  5 16:37:05 soak-14 kernel: [<ffffffffc09ec7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Oct  5 16:37:05 soak-14 kernel: [<ffffffffc09ec83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a7179e>] lnet_send+0x17e/0x180 [lnet]
Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a80ef8>] lnet_peer_discovery_complete+0x178/0x320 [lnet]
Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a868a8>] lnet_peer_discovery+0x588/0x1030 [lnet]
Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a86320>] ? lnet_peer_discovery+0x0/0x1030 [lnet]
Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Oct  5 16:37:05 soak-14 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Oct  5 16:37:05 soak-14 kernel:
Oct  5 16:37:05 soak-14 kernel: Kernel panic - not syncing: LBUG


 Comments   
Comment by Gerrit Updater [ 02/Nov/18 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33561
Subject: LU-10103 lnet: ensure txpeer = NULL when sending
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 746d4c0c11a831acf7f32f7b445ac44b44237597

Comment by Stephane Thiell [ 23/Mar/19 ]

Same issue with 2.12.0 + patch "LU-12065 lnd: increase CQ entries"

This happened only on one of our 12 LNet routers that we upgraded in a rolling update fashion today to include patch from LU-12065. No big deal I guess. And looks like a patch is ready but hasn't landed yet.

Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ------------[ cut here ]------------     
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: WARNING: CPU: 4 PID: 87771 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: list_del corruption. next->prev should be ffff8fa9f6c59c10, but was ffff8fa9f5a1a1a0
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dell_rbu sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper mxm_wmi iTCO_wdt iTCO_vendor_support cryptd dcdbas cdc_ether usbnet mii mgag200 i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks pcspkr sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter mei_me mei lpc_ich sunrpc ip_tables xfs libcrc32c mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ahci mlx4_core(OE) libahci tg3 mlx_compat(OE) megaraid_sas ptp libata devlink pps_core
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: CPU: 4 PID: 87771 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7.x86_64 #1
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.8.0 005/17/2018
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Call Trace:                              
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b62e41>] dump_stack+0x19/0x1b
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84497688>] __warn+0xd8/0x100   
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff8449770f>] warn_slowpath_fmt+0x5f/0x80
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08bc0e4>] ? lnet_ni_send+0x44/0xd0 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84795112>] __list_del_entry+0x82/0xd0
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d5352>] lnet_peer_discovery_complete+0x1a2/0x340 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08da0a0>] lnet_peer_discovery+0x6c0/0x1150 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d99e0>] ? lnet_peer_merge_data+0xde0/0xde0 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1c71>] kthread+0xd1/0xe0   
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b75c37>] ret_from_fork_nospec_begin+0x21/0x21
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ---[ end trace d6bf07925ff146d5 ]---     
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: LNetError: 87771:0:(lib-move.c:2645:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed: 
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: LNetError: 87771:0:(lib-move.c:2645:lnet_send()) LBUG
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Pid: 87771, comm: lnet_discovery 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Call Trace:                              
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08217cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc082187c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08c3ec8>] lnet_send+0x1b8/0x1c0 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d5328>] lnet_peer_discovery_complete+0x178/0x340 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08da0a0>] lnet_peer_discovery+0x6c0/0x1150 [lnet]
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1c71>] kthread+0xd1/0xe0   
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b75c37>] ret_from_fork_nospec_end+0x0/0x39
Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff  
Generated at Sat Feb 10 02:32:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.