[LU-7574] ASSERTION( crpc != ((void *)0) && rpc == crpc->crp_rpc ) failed Created: 17/Dec/15  Updated: 26/Jun/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jeremy Filizetti Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: selftest

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LBUG while using lst:

Lustre: Lustre: Build Version: 2.7.61-gb399b36-CHANGED-2.6.32-504.16.2.el6.x86_64
LNet: Added LNI 10.10.6.1@o2ib [54/1728/0/180]
LustreError: 4003:0:(brw_test.c:363:brw_server_rpc_done()) Bulk transfer to 12345-10.10.7.1@o2ib has failed: -108
LustreError: 4008:0:(brw_test.c:318:brw_client_done_rpc()) BRW RPC to 12345-10.10.7.5@o2ib failed with -4
LNet: 3969:0:(o2iblnd_cb.c:395:kiblnd_handle_rx()) PUT_NACK from 10.10.3.12@o2ib
LustreError: 4003:0:(brw_test.c:363:brw_server_rpc_done()) Skipped 1521 previous similar messages
LustreError: 4013:0:(conrpc.c:58:lstcon_rpc_done()) ASSERTION( crpc != ((void *)0) && rpc == crpc->crp_rpc ) failed: 
LustreError: 4013:0:(conrpc.c:58:lstcon_rpc_done()) LBUG
Pid: 4013, comm: lst_t_03_02

Call Trace:
 [<ffffffffa0695875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0695e77>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0d18664>] lstcon_rpc_done+0x124/0x190 [lnet_selftest]
 [<ffffffffa0d28d85>] srpc_client_rpc_done+0xd5/0x300 [lnet_selftest]
 [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
 [<ffffffffa0d29708>] srpc_send_rpc+0x408/0x850 [lnet_selftest]
 [<ffffffffa0d1dfb3>] swi_wi_action+0x13/0x20 [lnet_selftest]
 [<ffffffffa06abcf2>] cfs_wi_scheduler+0x132/0x460 [libcfs]
 [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa06abbc0>] ? cfs_wi_scheduler+0x0/0x460 [libcfs]
 [<ffffffff8109e71e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e680>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1450358264.4013
------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P           ---------------   )
Hardware name: S5500WB
list_add corruption. prev->next should be next (ffff88063dfb4d90), but was ffff88064d7fee68. (prev=ffff88064d7fee68).
Modules linked in: lnet_selftest(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 4014, comm: st_timer Tainted: P           ---------------    2.6.32-504.16.2.el6.x86_64 #1
Call Trace:
 [<ffffffff81074e27>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81074f16>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffffa0d1e9ef>] ? sfw_create_rpc+0xef/0x320 [lnet_selftest]
 [<ffffffff8129ed7f>] ? __list_add+0x8f/0xa0
 [<ffffffffa0d16ecc>] ? lstcon_rpc_trans_addreq+0x2c/0x40 [lnet_selftest]
 [<ffffffffa0d183b6>] ? lstcon_rpc_pinger+0x3b6/0x540 [lnet_selftest]
 [<ffffffffa0d18540>] ? lstcon_rpc_done+0x0/0x190 [lnet_selftest]
 [<ffffffffa0d243ec>] ? stt_timer_main+0xcc/0x1d0 [lnet_selftest]
 [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0d24320>] ? stt_timer_main+0x0/0x1d0 [lnet_selftest]
 [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109e680>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
---[ end trace 3489aa37ec955e70 ]---
------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P        W  ---------------   )
Hardware name: S5500WB
list_add corruption. prev->next should be next (ffff88063dfb4d90), but was ffff88064d7fee68. (prev=ffff88064d7fee68).
Modules linked in: lnet_selftest(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 4014, comm: st_timer Tainted: P        W  ---------------    2.6.32-504.16.2.el6.x86_64 #1
Call Trace:
 [<ffffffff81074e27>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81074f16>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffffa0d1e9ef>] ? sfw_create_rpc+0xef/0x320 [lnet_selftest]
 [<ffffffff8129ed7f>] ? __list_add+0x8f/0xa0
 [<ffffffffa0d16ecc>] ? lstcon_rpc_trans_addreq+0x2c/0x40 [lnet_selftest]
 [<ffffffffa0d183b6>] ? lstcon_rpc_pinger+0x3b6/0x540 [lnet_selftest]
 [<ffffffffa0d18540>] ? lstcon_rpc_done+0x0/0x190 [lnet_selftest]
 [<ffffffffa0d243ec>] ? stt_timer_main+0xcc/0x1d0 [lnet_selftest]
 [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0d24320>] ? stt_timer_main+0x0/0x1d0 [lnet_selftest]
 [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109e680>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
---[ end trace 3489aa37ec955e71 ]---
LustreError: 4130:0:(conrpc.c:584:lstcon_rpc_trans_destroy()) ASSERTION( atomic_read(&trans->tas_remaining) == 0 ) failed: 
LustreError: 4130:0:(conrpc.c:584:lstcon_rpc_trans_destroy()) LBUG
Pid: 4130, comm: lst

Call Trace:
 [<ffffffffa0695875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0695e77>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0d18fbd>] lstcon_rpc_trans_destroy+0x2fd/0x300 [lnet_selftest]
 [<ffffffffa0d19b9e>] lstcon_rpc_pinger_stop+0x4e/0xb0 [lnet_selftest]
 [<ffffffffa0d122e1>] lstcon_session_end+0x51/0x230 [lnet_selftest]
 [<ffffffffa0d1d6e5>] lstcon_ioctl_entry+0xbb5/0x1470 [lnet_selftest]
 [<ffffffffa069ee77>] libcfs_ioctl+0x127/0x830 [libcfs]
 [<ffffffff811a1458>] ? do_filp_open+0x798/0xd20
 [<ffffffffa069c262>] libcfs_ioctl+0xa2/0x1a0 [libcfs]
 [<ffffffff811a3a42>] vfs_ioctl+0x22/0xa0
 [<ffffffff811a3be4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff811a4161>] sys_ioctl+0x81/0xa0
 [<ffffffff810e5b6e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

LustreError: dumping log to /tmp/lustre-log.1450358326.4130



 Comments   
Comment by Andreas Dilger [ 17/Dec/15 ]

What steps did you take before this problem was hit? Is this just the basic lnet-selftest.sh running, or is something else? It looks like there was a communication failure with the peer during the test, was there a network problem, or the peer stopped during the test?

It shouldn't matter if Shared Key crypto was enabled, since LNet Selftest runs below the PtlRPC layer, but just in case was SK enabled during this test?

Comment by Jeremy Filizetti [ 18/Dec/15 ]

This is a just a script that runs LNet selftest with different parameters:

run_test()
{
        local test=$1
        local concurrency=$2
        local distribute=$3
        local delay=$4

        lst new_session run_test
        lst add_group server $SERVER
        lst add_group client $CLIENT
        lst add_batch main_test
        lst add_test  --batch main_test --from client --to server --distribute $distribute --concurrency $concurrency $test
        lst run main_test
        echo "Running test $test with concurrency:$concurrency distribute:$distribute delay:$delay"
        lst stat --delay $delay server &
        sleep $((delay + 2))
        kill $!
        lst end_session
}

Each test is stopped after 60 seconds, parameters changed and started back up. I've seen the LBUG several times now. This is unrelated to the shared key crypto.

Generated at Sat Feb 10 02:10:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.