[LU-14627] Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: Created: 20/Apr/21 Updated: 30/May/22 Resolved: 14/Jun/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.12.7, Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
I am fairly certain this bug can also result in this assert: LNetError: 9873:0:(peer.c:305:lnet_destroy_peer_locked()) ASSERTION( lp->lp_rtr_refcount == 0 ) failed: We saw this bug with HPE's Lustre 2.12 which differs quite a bit from the community 2.12 (namely because we backported the multi-rail routing feature). I'm pretty sure the bug is present in community 2.14+ and it may be present in older versions. TL;DR - Flaw in discovery logic results in a lost reference on an lnet_peer object which results in it being improperly destroyed. MDS sends an OST_CONNECT RPC to 10.249.249.206@o2ib. The message is queued for discovery. This adds a reference to the lnet_peer object: 00000100:00000200:9.0:1618858502.715082:0:51842:0:(niobuf.c:85:ptl_send_buf()) Sending 520 bytes to portal 28, xid 1697249298880320, offset 0 00000400:00000200:9.0:1618858502.715083:0:51842:0:(lib-move.c:4905:LNetPut()) LNetPut -> 12345-10.249.249.206@o2ib 00000400:00000200:9.0:1618858502.715086:0:51842:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: 0 <<<< Reference added 00000400:00000200:9.0:1618858502.715086:0:51842:0:(peer.c:2250:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:9.0:1618858502.715087:0:51842:0:(peer.c:2291:lnet_discover_peer_locked()) non-blocking discovery 00000400:00000200:9.0:1618858502.715088:0:51842:0:(peer.c:2298:lnet_discover_peer_locked()) peer 10.249.249.206@o2ib NID 10.249.249.206@o2ib: 0. pending discovery 00000400:00000200:9.0:1618858502.715089:0:51842:0:(lib-move.c:2050:lnet_initiate_peer_discovery()) msg ffff90cd4773d5f0 delayed. 10.249.249.206@o2ib pending discovery Discovery thread wakes and begins processing the peer by sending a discovery ping (NB ping sent to 10.249.249.207@o2ib). 00000400:00000200:22.0:1618858502.715088:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:22.0:1618858502.715091:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6171 00000400:00000010:22.0:1618858502.715093:0:43692:0:(api-ni.c:1671:lnet_ping_buffer_alloc()) alloc '(pbuf)': 281 at ffff90ddf2942000 (tot 485268226). 00000400:00000010:22.0:1618858502.715093:0:43692:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff90cd54ed27f8 <<< PING MD. 00000400:00000010:22.0:1618858502.715095:0:43692:0:(lib-lnet.h:535:lnet_rspt_alloc()) rspt alloc ffff90ddcde80360 ... 00000400:00000200:22.0:1618858502.715114:0:43692:0:(lib-move.c:1897:lnet_handle_send()) TRACE: 10.249.248.3@o2ib(10.249.248.3@o2ib:<?>) -> 10.249.249.207@o2ib(10.249.249.206@o2ib:10.249.249.207@o2ib) <?> : GET try# 0 ... 00000400:00000200:22.0:1618858502.715149:0:43692:0:(peer.c:3107:lnet_peer_send_ping()) peer 10.249.249.206@o2ib 00000400:00000200:22.0:1618858502.715150:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 rc 0 *hornc@cflosbld08 hornc $ lpst2str.sh 0x4371 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE LNET_PEER_PING_SENT LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ With the ping sent, discovery is waiting for either a response or timeout. Meanwhile, an incoming discovery push from 10.249.249.206@o2ib puts the peer back on the discovery queue: 00000400:00000200:27.0:1618858538.614280:0:43825:0:(lib-move.c:4396:lnet_parse()) TRACE: 10.249.248.2@o2ib(10.249.248.2@o2ib) <- 10.249.249.206@o2ib : PUT - for me ... 00000400:00000200:27.0:1618858538.614298:0:43825:0:(peer.c:2094:lnet_peer_push_event()) peer 10.249.249.206@o2ib(ffff90de2632b800) is MR 00000400:00000200:27.0:1618858538.614301:0:43825:0:(peer.c:2171:lnet_peer_push_event()) Received Push 10.249.249.206@o2ib 3 00000400:00000200:27.0:1618858538.614302:0:43825:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: -114 <<<< Peer already queued; no reference added Discovery thread wakes and begins processing the peer. It first merges the data in the push buffer: *hornc@cflosbld08 hornc $ lpst2str.sh 0x42f1 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_DATA_PRESENT LNET_PEER_PING_SENT LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ 00000400:00000200:22.0:1618858538.614308:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:22.0:1618858538.614316:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x42f1 00000400:00000200:22.0:1618858538.614326:0:43692:0:(peer.c:2776:lnet_peer_merge_data()) peer 10.249.249.206@o2ib (ffff90de2632b800): 0 00000400:00000200:22.0:1618858538.614327:0:43692:0:(peer.c:2985:lnet_peer_data_present()) peer 10.249.249.206@o2ib(ffff90de2632b800): 0. state = 0x4371 00000400:00000200:22.0:1618858538.614328:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 rc 1 Next, it sends a discovery PUSH to the peer (NB push sent to 10.249.249.207@o2ib): *hornc@cflosbld08 hornc $ lpst2str.sh 0x4371 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE LNET_PEER_PING_SENT LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ 00000400:00000200:22.0:1618858538.614329:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 00000400:00000010:22.0:1618858538.614330:0:43692:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff90cd54ed3430 <<< PUSH MD. 00000400:00000010:22.0:1618858538.614333:0:43692:0:(lib-lnet.h:535:lnet_rspt_alloc()) rspt alloc ffff90ddcde803f0 ... 00000400:00000200:22.0:1618858538.614351:0:43692:0:(lib-move.c:1897:lnet_handle_send()) TRACE: 10.249.248.2@o2ib(10.249.248.2@o2ib:<?>) -> 10.249.249.207@o2ib(10.249.249.206@o2ib:10.249.249.207@o2ib) <?> : PUT try# 0 ... 00000400:00000200:22.0:1618858538.614463:0:43692:0:(peer.c:3241:lnet_peer_send_push()) peer 10.249.249.206@o2ib 00000400:00000200:22.0:1618858538.614463:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x771 rc 0 *hornc@cflosbld08 hornc $ lpst2str.sh 0x771 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE LNET_PEER_PING_SENT LNET_PEER_PUSH_SENT *hornc@cflosbld08 hornc $ This is the first potential problem. I am not sure whether it is a valid state for the peer to have both PING_SENT and PUSH_SENT at the same time. I suspect it is not. The discovery ping eventually fails because o2iblnd cannot establish a connection with .207. The lnet_discovery_event_handler() sets the PING_FAILED state (via lnet_discovery_event_send()), and puts the peer back on the discovery queue.: 00000800:00000100:29.0:1618858552.405824:0:147651:0:(o2iblnd_cb.c:3183:kiblnd_cm_callback()) 10.249.249.207@o2ib: ADDR ERROR -110 00000800:00000200:29.0:1618858552.405825:0:147651:0:(o2iblnd.c:419:kiblnd_unlink_peer_locked()) peer_ni[ffff90cdedce3e80] -> 10.249.249.207@o2ib (2)-- 00000400:00000200:29.0:1618858552.405830:0:147651:0:(router.c:1720:lnet_notify()) 10.249.248.3@o2ib notifying 10.249.249.207@o2ib: down 00000800:00000100:29.0:1618858552.405833:0:147651:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.249.249.207@o2ib: connection failed 00000400:00000200:29.0:1618858552.405834:0:147651:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -113, hstatus = 2 00000400:00000200:29.0:1618858552.405835:0:147651:0:(lib-msg.c:860:lnet_health_check()) health check: 10.249.248.3@o2ib->10.249.249.207@o2ib: GET: LOCAL_DROPPED 00000400:00000200:29.0:1618858552.405836:0:147651:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.249.248.3@o2ib added to recovery queue. Health = 900 00000400:00000100:29.0:1618858552.405837:0:147651:0:(lib-msg.c:710:lnet_attempt_msg_resend()) msg 0@<0:0>->10.249.249.207@o2ib exceeded retry count 2 00000400:00000200:29.0:1618858552.405838:0:147651:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 5 00000400:00000200:29.0:1618858552.405839:0:147651:0:(peer.c:2508:lnet_discovery_event_send()) Ping Send to 10.249.249.206@o2ib: 1 <<<< Clears PING_SENT, sets PING_FAILED Here we see a second problem. Note the "GET: LOCAL_DROPPED" from lnet_health_check() and subsequent call to lnet_handle_local_failure(). The LOCAL_DROPPED health status will only cause the LNet health value of the local NI to be lowered even though the problem could be with the remote interface. This is why LNet is repeatedly trying to send to .207 even though the interface is not functioning properly. The good news is this is a known issue with a fix. The ticket for this is https://jira.whamcloud.com/browse/LU-13571 . Discovery thread wakes and processes the ping failure: *hornc@cflosbld08 hornc $ lpst2str.sh 0x6d71 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE LNET_PEER_PUSH_SENT LNET_PEER_PING_FAILED LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ 00000400:00000200:21.0:1618858552.405845:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:21.0:1618858552.405848:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6d71 The discovery thread is going to process the ping failure. lnet_peer_ping_failed() is called. The lp_ping_mdh is unlinked. The unlink handler sees LNET_PEER_PUSH_SENT so it thinks that the PUSH is what was unlinked: 00000400:00000200:21.0:1618858552.405850:0:43692:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 6 00000400:00000200:21.0:1618858552.405851:0:43692:0:(peer.c:2536:lnet_discovery_event_unlink()) Push Unlink for message to peer 10.249.249.206@o2ib <<<< Clears PUSH_SENT, sets PUSH_FAILED 00000400:00000200:21.0:1618858552.405853:0:43692:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff90cd54ed27f8 <<< PING MD 00000400:00000200:21.0:1618858552.405855:0:43692:0:(peer.c:3018:lnet_peer_ping_failed()) peer 10.249.249.206@o2ib:-113 00000400:00000200:21.0:1618858552.405856:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7171 rc -113 00000400:00000200:21.0:1618858552.405858:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -113 <<<< Clears LNET_PEER_DISCOVERING, state is 0x7131 00000400:00000200:21.0:1618858552.405859:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib <<<< Reference dropped *hornc@cflosbld08 hornc $ lpst2str.sh 0x7131 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_NIDS_UPTODATE LNET_PEER_PUSH_FAILED LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ This is another potential problem. I do not know if it is valid to have a PUSH_FAILED state when discovery completes. Similarly, it seems wrong to have FORCE_PING/FORCE_PUSH set when discovery completes. As part of lnet_peer_discovery_complete(), any messages that were queued for discovery are sent/finalized (includes our OST_CONNECT that started this all off): 00000100:00000200:21.0:1618858552.405863:0:43692:0:(events.c:63:request_out_callback()) @@@ type 5, status -113 req@ffff90bde43e5a00 x1697249298880320/t0(0) o8->scratch-OST00bc-osc-MDT0000@10.249.249.206@o2ib:28/4 lens 520/544 e 0 to 0 dl 1618858552 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 job:'' An MGS_CONNECT reply results in peer being queued for discovery again in the same manner as the earlier OST_CONNECT: 00010000:00000200:19.0:1618858552.405884:0:136340:0:(ldlm_lib.c:2986:target_send_reply_msg()) @@@ sending reply req@ffff90cdf3d8a850 x1697492294959168/t0(0) o250->69523420-f6a9-9da1-ae47-897c569694e9@10.249.249.206@o2ib:0/0 lens 520/416 e 0 to 0 dl 1618858838 ref 1 fl Interpret:/0/0 rc 0/0 job:'' 00000400:00000010:19.0:1618858552.405888:0:136340:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff90cdf451caa0. 00000100:00000200:19.0:1618858552.405888:0:136340:0:(niobuf.c:85:ptl_send_buf()) Sending 416 bytes to portal 25, xid 1697492294959168, offset 0 00000400:00000200:19.0:1618858552.405890:0:136340:0:(lib-move.c:4905:LNetPut()) LNetPut -> 12345-10.249.249.206@o2ib 00000400:00000200:19.0:1618858552.405900:0:136340:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: 0 <<<< Reference added 00000400:00000200:19.0:1618858552.405900:0:136340:0:(peer.c:2250:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:19.0:1618858552.405901:0:136340:0:(peer.c:2291:lnet_discover_peer_locked()) non-blocking discovery 00000400:00000200:19.0:1618858552.405902:0:136340:0:(peer.c:2298:lnet_discover_peer_locked()) peer 10.249.249.206@o2ib NID 10.249.249.206@o2ib: 0. pending discovery 00000400:00000200:19.0:1618858552.405903:0:136340:0:(lib-move.c:2050:lnet_initiate_peer_discovery()) msg ffff90de1d222940 delayed. 10.249.249.206@o2ib pending discovery *hornc@cflosbld08 hornc $ lpst2str.sh 0x7171 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE LNET_PEER_PUSH_FAILED LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ 00000400:00000200:21.0:1618858552.405902:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:21.0:1618858552.405906:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7171 00000400:00000200:21.0:1618858552.405906:0:43692:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff90cd54ed3430 00000400:00000200:21.0:1618858552.405907:0:43692:0:(peer.c:3146:lnet_peer_push_failed()) peer 10.249.249.206@o2ib 00000400:00000200:21.0:1618858552.405908:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6171 rc -110 00000400:00000200:21.0:1618858552.405913:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -110 <<<< Clears LNET_PEER_DISCOVERING, state is 0x6131 00000400:00000200:21.0:1618858552.405915:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib *hornc@cflosbld08 hornc $ lpst2str.sh 0x6131 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_NIDS_UPTODATE LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ Monitor thread attempts to resend PUSH 00000400:00000200:11.0:1618858555.397794:0:43693:0:(lib-move.c:3156:lnet_resend_pending_msgs_locked()) resending <?>->12345-10.249.249.207@o2ib: PUT recovery 0 try# 1 00000400:00000200:11.0:1618858555.397796:0:43693:0:(lib-move.c:2677:lnet_handle_send_case_locked()) Source ANY to MR: 10.249.249.207@o2ib local destination 00000400:00000200:11.0:1618858555.397798:0:43693:0:(lib-move.c:1659:lnet_get_best_ni()) compare ni 10.249.248.2@o2ib [c:509, d:10, s:12279791] with best_ni not seleced [c:-2147483648, d:-1, s:0] 00000400:00000200:11.0:1618858555.397800:0:43693:0:(lib-move.c:1659:lnet_get_best_ni()) compare ni 10.249.248.3@o2ib [c:509, d:12, s:7451] with best_ni 10.249.248.2@o2ib [c:509, d:10, s:12279791] 00000400:00000200:11.0:1618858555.397801:0:43693:0:(lib-move.c:1702:lnet_get_best_ni()) selected best_ni 10.249.248.2@o2ib 00000400:00000200:11.0:1618858555.397803:0:43693:0:(lib-move.c:1410:lnet_select_peer_ni()) n:[10.249.249.207@o2ib, 10.249.249.206@o2ib] h:[1000, 1000] r:[n, n] c:[8, 8] s:[67, 50963] 00000400:00000200:11.0:1618858555.397804:0:43693:0:(lib-move.c:1460:lnet_select_peer_ni()) sd_best_lpni = 10.249.249.207@o2ib 00000400:00000100:11.0:1618858555.397807:0:43693:0:(lib-move.c:937:lnet_post_send_locked()) Aborting message for 12345-10.249.249.207@o2ib: LNetM[DE]Unlink() already called on the MD/ME. 00000400:00020000:11.0:1618858555.414727:0:43693:0:(lib-move.c:3162:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.249.249.207@o2ib: -125 00000400:00000200:11.0:1618858555.429119:0:43693:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -125, hstatus = 5 00000400:00000200:11.0:1618858555.429120:0:43693:0:(lib-msg.c:860:lnet_health_check()) health check: 10.249.248.2@o2ib->10.249.249.207@o2ib: PUT: LOCAL_ERROR 00000400:00000200:11.0:1618858555.429121:0:43693:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 5 00000400:00000200:11.0:1618858555.429122:0:43693:0:(peer.c:2508:lnet_discovery_event_send()) Push Send to 10.249.249.206@o2ib: 1 <<< Sets PUSH_FAILED, Peer placed back on discovery queue without reference taken BUG HERE 00000400:00000200:11.0:1618858555.429125:0:43693:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff90cd54ed3430 <<< PUSH MD Discovery wakes, processes the failed push. Peer is not in DISCOVERING state, so lnet_peer_discovery_complete() is called. We drop an extra reference and trip the LBUG. *hornc@cflosbld08 hornc $ lpst2str.sh 0x7131 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_NIDS_UPTODATE LNET_PEER_PUSH_FAILED LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ 00000400:00000200:22.0:1618858555.429128:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:22.0:1618858555.429130:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7131 00000400:00000200:22.0:1618858555.429131:0:43692:0:(peer.c:3146:lnet_peer_push_failed()) peer 10.249.249.206@o2ib 00000400:00000200:22.0:1618858555.429132:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6131 rc -125 00000400:00000200:22.0:1618858555.429134:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -125 00000400:00000200:22.0:1618858555.429135:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib <<< Reference dropped -> LBUG 00000400:00000200:22.0:1618858555.429137:0:43692:0:(peer.c:293:lnet_destroy_peer_locked()) ffff90de2632b800 nid 10.249.249.206@o2ib *hornc@cflosbld08 hornc $ lpst2str.sh 0x6131 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_NIDS_UPTODATE LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ |
| Comments |
| Comment by Chris Horn [ 20/Apr/21 ] |
|
This is an abridged description of the bug which I will use to try and create a reproducer. May require instrumenting the code to get the timing to line up: peer queued for discovery *hornc@cflosbld08 hornc $ lpst2str.sh 0x771 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_DISCOVERING LNET_PEER_NIDS_UPTODATE LNET_PEER_PING_SENT LNET_PEER_PUSH_SENT *hornc@cflosbld08 hornc $ ping fails -> PING_SENT cleared *hornc@cflosbld08 hornc $ lpst2str.sh 0x7131 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_NIDS_UPTODATE LNET_PEER_PUSH_FAILED LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ peer queued for discovery *hornc@cflosbld08 hornc $ lpst2str.sh 0x6131 LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERED LNET_PEER_REDISCOVER LNET_PEER_NIDS_UPTODATE LNET_PEER_FORCE_PING LNET_PEER_FORCE_PUSH *hornc@cflosbld08 hornc $ monitor thread: discovery thread processes peer and we hit lbug |
| Comment by Gerrit Updater [ 22/Apr/21 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43416 |
| Comment by Gerrit Updater [ 22/Apr/21 ] |
|
-Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43417 This patch was squashed with https://review.whamcloud.com/43418 |
| Comment by Gerrit Updater [ 22/Apr/21 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43418 |
| Comment by Gerrit Updater [ 23/Apr/21 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43425 |
| Comment by Gerrit Updater [ 23/Apr/21 ] |
|
<deleted> |
| Comment by Gerrit Updater [ 05/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43416/ |
| Comment by Gerrit Updater [ 11/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43425/ |
| Comment by SC Admin (Inactive) [ 01/Jun/21 ] |
|
FYI we encountered something that looks like this with 2.12.6 client and servers. 2021-05-29 12:04:53 [5935871.518144] LNetError: 911:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: 2021-05-29 12:04:53 [5935871.530347] LNetError: 911:0:(peer.c:282:lnet_destroy_peer_locked()) LBUG 2021-05-29 12:04:53 [5935871.538442] Pid: 911, comm: lnet_discovery 3.10.0-1160.21.1.el7.x86_64 #1 SMP Tue Mar 16 18:28:22 UTC 2021 2021-05-29 12:04:53 [5935871.549410] Call Trace: 2021-05-29 12:04:53 [5935871.553213] [<ffffffffc055d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] 2021-05-29 12:04:53 [5935871.561049] [<ffffffffc055d87c>] lbug_with_loc+0x4c/0xa0 [libcfs] 2021-05-29 12:04:53 [5935871.568503] [<ffffffffc078ffca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet] 2021-05-29 12:04:53 [5935871.576898] [<ffffffffc0790605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet] 2021-05-29 12:04:53 [5935871.585617] [<ffffffffc0795340>] lnet_peer_discovery+0x6c0/0x1140 [lnet] 2021-05-29 12:04:53 [5935871.593632] [<ffffffffac0c5da1>] kthread+0xd1/0xe0 2021-05-29 12:04:53 [5935871.599733] [<ffffffffac795df7>] ret_from_fork_nospec_end+0x0/0x39 2021-05-29 12:04:53 [5935871.607198] [<ffffffffffffffff>] 0xffffffffffffffff 2021-05-29 12:04:53 [5935871.613357] Kernel panic - not syncing: LBUG 2021-05-29 12:04:53 [5935871.618770] CPU: 2 PID: 911 Comm: lnet_discovery Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.21.1.el7.x86_64 #1 2021-05-29 12:04:53 [5935871.632856] Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 2.10.0 11/12/2020 2021-05-29 12:04:53 [5935871.641552] Call Trace: 2021-05-29 12:04:53 [5935871.645136] [<ffffffffac78305a>] dump_stack+0x19/0x1b 2021-05-29 12:04:53 [5935871.651390] [<ffffffffac77c5b2>] panic+0xe8/0x21f 2021-05-29 12:04:53 [5935871.657283] [<ffffffffc055d8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] 2021-05-29 12:04:53 [5935871.664553] [<ffffffffc078ffca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet] 2021-05-29 12:04:53 [5935871.672756] [<ffffffffc0790605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet] 2021-05-29 12:04:53 [5935871.681295] [<ffffffffc0795340>] lnet_peer_discovery+0x6c0/0x1140 [lnet] 2021-05-29 12:04:53 [5935871.689117] [<ffffffffac0c6e90>] ? wake_up_atomic_t+0x30/0x30 2021-05-29 12:04:53 [5935871.695985] [<ffffffffc0794c80>] ? lnet_peer_merge_data+0xe00/0xe00 [lnet] 2021-05-29 12:04:53 [5935871.703954] [<ffffffffac0c5da1>] kthread+0xd1/0xe0 2021-05-29 12:04:53 [5935871.709833] [<ffffffffac0c5cd0>] ? insert_kthread_work+0x40/0x40 2021-05-29 12:04:53 [5935871.716901] [<ffffffffac795df7>] ret_from_fork_nospec_begin+0x21/0x21 2021-05-29 12:04:53 [5935871.724393] [<ffffffffac0c5cd0>] ? insert_kthread_work+0x40/0x40 it happened when mounting a new filesystem on ~100 nodes at once. we have a crash dump of the one node that LBUG'd if that would help. cheers, |
| Comment by Chris Horn [ 08/Jun/21 ] |
|
jpeyrard The fix for this issue is " " |
| Comment by Gerrit Updater [ 09/Jun/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43959 |
| Comment by Gerrit Updater [ 09/Jun/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43960 |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43418/ |
| Comment by Peter Jones [ 14/Jun/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44001 |
| Comment by Gerrit Updater [ 27/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43959/ |
| Comment by Gerrit Updater [ 27/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43960/ |
| Comment by Gerrit Updater [ 27/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44001/ |
| Comment by Gerrit Updater [ 22/Apr/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47116 |
| Comment by Gerrit Updater [ 30/May/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47116/ |