Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
Lustre 2.17.0
-
3
-
9223372036854775807
Description
The following scenario results in an LBUG:
- About to add a net, for example tcp999
- Before tcp999 is configured locally, try to send to a peer on @tcp999.
- lnet_peerni_by_nid_locked() allocates a peer_ni with lpni->lpni_net == NULL and puts it on ln_remote_peer_ni_list (lpni_txcredits is 0 at start in this case)
- A send attempt consumes peer-tx credit (lpni_txcredits-- goes negative, message is put on lpni_txq)
- Later the net/LNI get added, triggering lnet_peer_net_added() (lpni_txcredits is set to the tunable value (e.g. 8), but lpni_txq is not drained)
- Now txcredits = 8 and txq is non-empty
- LBUG on the next attempt to take a peer-tx credit
[16931.136942] LNetError: 898962:0:(lib-move.c:884:lnet_post_send_locked()) ASSERTION( (lp->lpni_txcredits < 0) == !list_empty(&lp->lpni_txq) ) failed: [16931.139496] LNetError: 898962:0:(lib-move.c:884:lnet_post_send_locked()) LBUG [16931.140806] CPU: 1 PID: 898962 Comm: mdt00_001 Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.89.1.el8_lustre.x86_64 #1 [16931.143021] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [16931.144061] Call Trace: [16931.144602] dump_stack+0x41/0x60 [16931.145299] lbug_with_loc.cold.6+0x5/0x43 [libcfs] [16931.146262] lnet_post_send_locked+0x6b9/0x780 [lnet] [16931.147389] lnet_handle_send+0x2a2/0x650 [lnet] [16931.148287] lnet_select_pathway+0x629/0x1a90 [lnet] [16931.149244] ? cfs_trace_unlock_tcd+0x20/0x70 [libcfs] [16931.150201] ? libcfs_debug_msg+0x907/0xc00 [libcfs] [16931.151126] lnet_send+0x6d/0x1e0 [lnet] [16931.151898] LNetPut+0x2f8/0x950 [lnet] [16931.152662] ptl_send_buf+0x132/0x540 [ptlrpc] [16931.154072] ? ptlrpc_ni_fini+0x60/0x60 [ptlrpc] [16931.155019] ptlrpc_send_reply+0x2f5/0x8d0 [ptlrpc] [16931.155997] target_send_reply+0x328/0x7a0 [ptlrpc] [16931.157017] tgt_request_handle+0x454/0x1d20 [ptlrpc] [16931.158076] ptlrpc_server_handle_request+0x2ca/0xd70 [ptlrpc] [16931.159259] ? lprocfs_counter_add+0x117/0x180 [obdclass] [16931.160557] ptlrpc_main+0xb98/0x1460 [ptlrpc] [16931.161483] ? __schedule+0x2d9/0x870 [16931.162196] ? ptlrpc_wait_event+0x5b0/0x5b0 [ptlrpc] [16931.163218] kthread+0x134/0x150 [16931.163879] ? set_kthread_struct+0x50/0x50 [16931.164670] ret_from_fork+0x35/0x40 [16931.165397] Kernel panic - not syncing: LBUG