Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
LBUG while using lst:
Lustre: Lustre: Build Version: 2.7.61-gb399b36-CHANGED-2.6.32-504.16.2.el6.x86_64 LNet: Added LNI 10.10.6.1@o2ib [54/1728/0/180] LustreError: 4003:0:(brw_test.c:363:brw_server_rpc_done()) Bulk transfer to 12345-10.10.7.1@o2ib has failed: -108 LustreError: 4008:0:(brw_test.c:318:brw_client_done_rpc()) BRW RPC to 12345-10.10.7.5@o2ib failed with -4 LNet: 3969:0:(o2iblnd_cb.c:395:kiblnd_handle_rx()) PUT_NACK from 10.10.3.12@o2ib LustreError: 4003:0:(brw_test.c:363:brw_server_rpc_done()) Skipped 1521 previous similar messages LustreError: 4013:0:(conrpc.c:58:lstcon_rpc_done()) ASSERTION( crpc != ((void *)0) && rpc == crpc->crp_rpc ) failed: LustreError: 4013:0:(conrpc.c:58:lstcon_rpc_done()) LBUG Pid: 4013, comm: lst_t_03_02 Call Trace: [<ffffffffa0695875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0695e77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0d18664>] lstcon_rpc_done+0x124/0x190 [lnet_selftest] [<ffffffffa0d28d85>] srpc_client_rpc_done+0xd5/0x300 [lnet_selftest] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0 [<ffffffffa0d29708>] srpc_send_rpc+0x408/0x850 [lnet_selftest] [<ffffffffa0d1dfb3>] swi_wi_action+0x13/0x20 [lnet_selftest] [<ffffffffa06abcf2>] cfs_wi_scheduler+0x132/0x460 [libcfs] [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa06abbc0>] ? cfs_wi_scheduler+0x0/0x460 [libcfs] [<ffffffff8109e71e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e680>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1450358264.4013 ------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P --------------- ) Hardware name: S5500WB list_add corruption. prev->next should be next (ffff88063dfb4d90), but was ffff88064d7fee68. (prev=ffff88064d7fee68). Modules linked in: lnet_selftest(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 4014, comm: st_timer Tainted: P --------------- 2.6.32-504.16.2.el6.x86_64 #1 Call Trace: [<ffffffff81074e27>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81074f16>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffffa0d1e9ef>] ? sfw_create_rpc+0xef/0x320 [lnet_selftest] [<ffffffff8129ed7f>] ? __list_add+0x8f/0xa0 [<ffffffffa0d16ecc>] ? lstcon_rpc_trans_addreq+0x2c/0x40 [lnet_selftest] [<ffffffffa0d183b6>] ? lstcon_rpc_pinger+0x3b6/0x540 [lnet_selftest] [<ffffffffa0d18540>] ? lstcon_rpc_done+0x0/0x190 [lnet_selftest] [<ffffffffa0d243ec>] ? stt_timer_main+0xcc/0x1d0 [lnet_selftest] [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0d24320>] ? stt_timer_main+0x0/0x1d0 [lnet_selftest] [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109e680>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ---[ end trace 3489aa37ec955e70 ]--- ------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P W --------------- ) Hardware name: S5500WB list_add corruption. prev->next should be next (ffff88063dfb4d90), but was ffff88064d7fee68. (prev=ffff88064d7fee68). Modules linked in: lnet_selftest(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 4014, comm: st_timer Tainted: P W --------------- 2.6.32-504.16.2.el6.x86_64 #1 Call Trace: [<ffffffff81074e27>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81074f16>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffffa0d1e9ef>] ? sfw_create_rpc+0xef/0x320 [lnet_selftest] [<ffffffff8129ed7f>] ? __list_add+0x8f/0xa0 [<ffffffffa0d16ecc>] ? lstcon_rpc_trans_addreq+0x2c/0x40 [lnet_selftest] [<ffffffffa0d183b6>] ? lstcon_rpc_pinger+0x3b6/0x540 [lnet_selftest] [<ffffffffa0d18540>] ? lstcon_rpc_done+0x0/0x190 [lnet_selftest] [<ffffffffa0d243ec>] ? stt_timer_main+0xcc/0x1d0 [lnet_selftest] [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0d24320>] ? stt_timer_main+0x0/0x1d0 [lnet_selftest] [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109e680>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ---[ end trace 3489aa37ec955e71 ]--- LustreError: 4130:0:(conrpc.c:584:lstcon_rpc_trans_destroy()) ASSERTION( atomic_read(&trans->tas_remaining) == 0 ) failed: LustreError: 4130:0:(conrpc.c:584:lstcon_rpc_trans_destroy()) LBUG Pid: 4130, comm: lst Call Trace: [<ffffffffa0695875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0695e77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0d18fbd>] lstcon_rpc_trans_destroy+0x2fd/0x300 [lnet_selftest] [<ffffffffa0d19b9e>] lstcon_rpc_pinger_stop+0x4e/0xb0 [lnet_selftest] [<ffffffffa0d122e1>] lstcon_session_end+0x51/0x230 [lnet_selftest] [<ffffffffa0d1d6e5>] lstcon_ioctl_entry+0xbb5/0x1470 [lnet_selftest] [<ffffffffa069ee77>] libcfs_ioctl+0x127/0x830 [libcfs] [<ffffffff811a1458>] ? do_filp_open+0x798/0xd20 [<ffffffffa069c262>] libcfs_ioctl+0xa2/0x1a0 [libcfs] [<ffffffff811a3a42>] vfs_ioctl+0x22/0xa0 [<ffffffff811a3be4>] do_vfs_ioctl+0x84/0x580 [<ffffffff811a4161>] sys_ioctl+0x81/0xa0 [<ffffffff810e5b6e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b LustreError: dumping log to /tmp/lustre-log.1450358326.4130
This is a just a script that runs LNet selftest with different parameters:
run_test() { local test=$1 local concurrency=$2 local distribute=$3 local delay=$4 lst new_session run_test lst add_group server $SERVER lst add_group client $CLIENT lst add_batch main_test lst add_test --batch main_test --from client --to server --distribute $distribute --concurrency $concurrency $test lst run main_test echo "Running test $test with concurrency:$concurrency distribute:$distribute delay:$delay" lst stat --delay $delay server & sleep $((delay + 2)) kill $! lst end_session }
Each test is stopped after 60 seconds, parameters changed and started back up. I've seen the LBUG several times now. This is unrelated to the shared key crypto.