Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10103

LBUG: lib-move.c:2121:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.2
    • Soak test cluster
    • 3
    • 9223372036854775807

    Description

      Testing https://review.whamcloud.com/29341.(Revert patch for LU-9810 to determine if preferring
      Fast Reg breaks mounting targets.)
      System mounts fine (LU-10068) - but after a few hours, routers have LBUG:

      Oct  5 16:25:31 soak-14 kernel: LNet: 2153:0:(o2iblnd_modparams.c:253:kiblnd_tunables_setup()) Invalid map_on_demand (0), expects 1 - 256. Using default of 256
      Oct  5 16:25:31 soak-14 kernel: LNet: Using FMR for registration
      Oct  5 16:25:31 soak-14 kernel: LNetError: 4:0:(o2iblnd_cb.c:2304:kiblnd_passive_connect()) Can't accept conn from 192.168.1.121@o2ib on NA (ib1:0:192.168.1.114): bad dst nid 192.168.1.114@o2ib
      Oct  5 16:25:31 soak-14 kernel: LNet: Added LNI 192.168.1.114@o2ib [8/256/0/180]
      Oct  5 16:25:31 soak-14 kernel: LNet: Added LNI 172.16.1.14@o2ib1 [128/2048/0/180]
      Oct  5 16:25:31 soak-14 sshd[2130]: Received disconnect from 10.10.1.116 port 38944:11: disconnected by user
      Oct  5 16:25:31 soak-14 sshd[2130]: Disconnected from 10.10.1.116 port 38944
      Oct  5 16:25:31 soak-14 sshd[2130]: pam_unix(sshd:session): session closed for user root
      Oct  5 16:25:31 soak-14 systemd-logind: Removed session 4.
      Oct  5 16:25:31 soak-14 systemd: Removed slice User Slice of root.
      Oct  5 16:25:31 soak-14 systemd: Stopping User Slice of root.
      Oct  5 16:37:04 soak-14 kernel: LNetError: 1979:0:(lib-move.c:2121:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed:
      Oct  5 16:37:04 soak-14 kernel: LNetError: 1979:0:(lib-move.c:2121:lnet_send()) LBUG
      Oct  5 16:37:04 soak-14 kernel: Pid: 1979, comm: lnet_discovery
      Oct  5 16:37:05 soak-14 kernel: #012Call Trace:
      Oct  5 16:37:05 soak-14 kernel: [<ffffffffc09ec7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      Oct  5 16:37:05 soak-14 kernel: [<ffffffffc09ec83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a7179e>] lnet_send+0x17e/0x180 [lnet]
      Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a80ef8>] lnet_peer_discovery_complete+0x178/0x320 [lnet]
      Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a868a8>] lnet_peer_discovery+0x588/0x1030 [lnet]
      Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
      Oct  5 16:37:05 soak-14 kernel: [<ffffffffc0a86320>] ? lnet_peer_discovery+0x0/0x1030 [lnet]
      Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct  5 16:37:05 soak-14 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Oct  5 16:37:05 soak-14 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct  5 16:37:05 soak-14 kernel:
      Oct  5 16:37:05 soak-14 kernel: Kernel panic - not syncing: LBUG
      

      Attachments

        Activity

          [LU-10103] LBUG: lib-move.c:2121:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed

          Same issue with 2.12.0 + patch "LU-12065 lnd: increase CQ entries"

          This happened only on one of our 12 LNet routers that we upgraded in a rolling update fashion today to include patch from LU-12065. No big deal I guess. And looks like a patch is ready but hasn't landed yet.

          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ------------[ cut here ]------------     
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: WARNING: CPU: 4 PID: 87771 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: list_del corruption. next->prev should be ffff8fa9f6c59c10, but was ffff8fa9f5a1a1a0
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dell_rbu sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper mxm_wmi iTCO_wdt iTCO_vendor_support cryptd dcdbas cdc_ether usbnet mii mgag200 i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks pcspkr sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter mei_me mei lpc_ich sunrpc ip_tables xfs libcrc32c mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ahci mlx4_core(OE) libahci tg3 mlx_compat(OE) megaraid_sas ptp libata devlink pps_core
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: CPU: 4 PID: 87771 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7.x86_64 #1
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.8.0 005/17/2018
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Call Trace:                              
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b62e41>] dump_stack+0x19/0x1b
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84497688>] __warn+0xd8/0x100   
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff8449770f>] warn_slowpath_fmt+0x5f/0x80
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08bc0e4>] ? lnet_ni_send+0x44/0xd0 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84795112>] __list_del_entry+0x82/0xd0
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d5352>] lnet_peer_discovery_complete+0x1a2/0x340 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08da0a0>] lnet_peer_discovery+0x6c0/0x1150 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d99e0>] ? lnet_peer_merge_data+0xde0/0xde0 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1c71>] kthread+0xd1/0xe0   
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b75c37>] ret_from_fork_nospec_begin+0x21/0x21
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ---[ end trace d6bf07925ff146d5 ]---     
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: LNetError: 87771:0:(lib-move.c:2645:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed: 
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: LNetError: 87771:0:(lib-move.c:2645:lnet_send()) LBUG
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Pid: 87771, comm: lnet_discovery 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Call Trace:                              
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08217cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc082187c>] lbug_with_loc+0x4c/0xa0 [libcfs]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08c3ec8>] lnet_send+0x1b8/0x1c0 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d5328>] lnet_peer_discovery_complete+0x178/0x340 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08da0a0>] lnet_peer_discovery+0x6c0/0x1150 [lnet]
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1c71>] kthread+0xd1/0xe0   
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b75c37>] ret_from_fork_nospec_end+0x0/0x39
          Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff  
          
          sthiell Stephane Thiell added a comment - Same issue with 2.12.0 + patch " LU-12065 lnd: increase CQ entries" This happened only on one of our 12 LNet routers that we upgraded in a rolling update fashion today to include patch from LU-12065 . No big deal I guess. And looks like a patch is ready but hasn't landed yet. Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ------------[ cut here ]------------ Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: WARNING: CPU: 4 PID: 87771 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: list_del corruption. next->prev should be ffff8fa9f6c59c10, but was ffff8fa9f5a1a1a0 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dell_rbu sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper mxm_wmi iTCO_wdt iTCO_vendor_support cryptd dcdbas cdc_ether usbnet mii mgag200 i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks pcspkr sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter mei_me mei lpc_ich sunrpc ip_tables xfs libcrc32c mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ahci mlx4_core(OE) libahci tg3 mlx_compat(OE) megaraid_sas ptp libata devlink pps_core Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: CPU: 4 PID: 87771 Comm: lnet_discovery Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7.x86_64 #1 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.8.0 005/17/2018 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Call Trace: Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b62e41>] dump_stack+0x19/0x1b Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84497688>] __warn+0xd8/0x100 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff8449770f>] warn_slowpath_fmt+0x5f/0x80 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08bc0e4>] ? lnet_ni_send+0x44/0xd0 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84795112>] __list_del_entry+0x82/0xd0 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d5352>] lnet_peer_discovery_complete+0x1a2/0x340 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08da0a0>] lnet_peer_discovery+0x6c0/0x1150 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d99e0>] ? lnet_peer_merge_data+0xde0/0xde0 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1c71>] kthread+0xd1/0xe0 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b75c37>] ret_from_fork_nospec_begin+0x21/0x21 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: ---[ end trace d6bf07925ff146d5 ]--- Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: LNetError: 87771:0:(lib-move.c:2645:lnet_send()) ASSERTION( msg->msg_txpeer == ((void *)0) ) failed: Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: LNetError: 87771:0:(lib-move.c:2645:lnet_send()) LBUG Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Pid: 87771, comm: lnet_discovery 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: Call Trace: Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08217cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc082187c>] lbug_with_loc+0x4c/0xa0 [libcfs] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08c3ec8>] lnet_send+0x1b8/0x1c0 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08d5328>] lnet_peer_discovery_complete+0x178/0x340 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffc08da0a0>] lnet_peer_discovery+0x6c0/0x1150 [lnet] Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff844c1c71>] kthread+0xd1/0xe0 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffff84b75c37>] ret_from_fork_nospec_end+0x0/0x39 Mar 22 17:29:15 sh-rtr-oak-1-1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

          Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33561
          Subject: LU-10103 lnet: ensure txpeer = NULL when sending
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 746d4c0c11a831acf7f32f7b445ac44b44237597

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33561 Subject: LU-10103 lnet: ensure txpeer = NULL when sending Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 746d4c0c11a831acf7f32f7b445ac44b44237597

          People

            ashehata Amir Shehata (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: