Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14627

Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed:

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.7, Lustre 2.15.0
    • Lustre 2.14.0, Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      I am fairly certain this bug can also result in this assert:

      LNetError: 9873:0:(peer.c:305:lnet_destroy_peer_locked()) ASSERTION( lp->lp_rtr_refcount == 0 ) failed:
      

      We saw this bug with HPE's Lustre 2.12 which differs quite a bit from the community 2.12 (namely because we backported the multi-rail routing feature). I'm pretty sure the bug is present in community 2.14+ and it may be present in older versions.

      TL;DR - Flaw in discovery logic results in a lost reference on an lnet_peer object which results in it being improperly destroyed.

      MDS sends an OST_CONNECT RPC to 10.249.249.206@o2ib. The message is queued for discovery. This adds a reference to the lnet_peer object:

      00000100:00000200:9.0:1618858502.715082:0:51842:0:(niobuf.c:85:ptl_send_buf()) Sending 520 bytes to portal 28, xid 1697249298880320, offset 0
      00000400:00000200:9.0:1618858502.715083:0:51842:0:(lib-move.c:4905:LNetPut()) LNetPut -> 12345-10.249.249.206@o2ib
      00000400:00000200:9.0:1618858502.715086:0:51842:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: 0 <<<< Reference added
      00000400:00000200:9.0:1618858502.715086:0:51842:0:(peer.c:2250:lnet_discover_peer_locked()) Discovery attempt # 1
      00000400:00000200:9.0:1618858502.715087:0:51842:0:(peer.c:2291:lnet_discover_peer_locked()) non-blocking discovery
      00000400:00000200:9.0:1618858502.715088:0:51842:0:(peer.c:2298:lnet_discover_peer_locked()) peer 10.249.249.206@o2ib NID 10.249.249.206@o2ib: 0. pending discovery
      00000400:00000200:9.0:1618858502.715089:0:51842:0:(lib-move.c:2050:lnet_initiate_peer_discovery()) msg ffff90cd4773d5f0 delayed. 10.249.249.206@o2ib pending discovery
      

      Discovery thread wakes and begins processing the peer by sending a discovery ping (NB ping sent to 10.249.249.207@o2ib).

      00000400:00000200:22.0:1618858502.715088:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:22.0:1618858502.715091:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6171
      00000400:00000010:22.0:1618858502.715093:0:43692:0:(api-ni.c:1671:lnet_ping_buffer_alloc()) alloc '(pbuf)': 281 at ffff90ddf2942000 (tot 485268226).
      00000400:00000010:22.0:1618858502.715093:0:43692:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff90cd54ed27f8 <<< PING MD.
      00000400:00000010:22.0:1618858502.715095:0:43692:0:(lib-lnet.h:535:lnet_rspt_alloc()) rspt alloc ffff90ddcde80360
      ...
      00000400:00000200:22.0:1618858502.715114:0:43692:0:(lib-move.c:1897:lnet_handle_send()) TRACE: 10.249.248.3@o2ib(10.249.248.3@o2ib:<?>) -> 10.249.249.207@o2ib(10.249.249.206@o2ib:10.249.249.207@o2ib) <?> : GET try# 0
      ...
      00000400:00000200:22.0:1618858502.715149:0:43692:0:(peer.c:3107:lnet_peer_send_ping()) peer 10.249.249.206@o2ib
      00000400:00000200:22.0:1618858502.715150:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 rc 0
      *hornc@cflosbld08 hornc $ lpst2str.sh 0x4371
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_DISCOVERING
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PING_SENT
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      

      With the ping sent, discovery is waiting for either a response or timeout. Meanwhile, an incoming discovery push from 10.249.249.206@o2ib puts the peer back on the discovery queue:

      00000400:00000200:27.0:1618858538.614280:0:43825:0:(lib-move.c:4396:lnet_parse()) TRACE: 10.249.248.2@o2ib(10.249.248.2@o2ib) <- 10.249.249.206@o2ib : PUT - for me
      ...
      00000400:00000200:27.0:1618858538.614298:0:43825:0:(peer.c:2094:lnet_peer_push_event()) peer 10.249.249.206@o2ib(ffff90de2632b800) is MR
      00000400:00000200:27.0:1618858538.614301:0:43825:0:(peer.c:2171:lnet_peer_push_event()) Received Push 10.249.249.206@o2ib 3
      00000400:00000200:27.0:1618858538.614302:0:43825:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: -114 <<<< Peer already queued; no reference added
      

      Discovery thread wakes and begins processing the peer. It first merges the data in the push buffer:

      *hornc@cflosbld08 hornc $ lpst2str.sh 0x42f1
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_DISCOVERING
      LNET_PEER_DATA_PRESENT
      LNET_PEER_PING_SENT
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      00000400:00000200:22.0:1618858538.614308:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:22.0:1618858538.614316:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x42f1
      00000400:00000200:22.0:1618858538.614326:0:43692:0:(peer.c:2776:lnet_peer_merge_data()) peer 10.249.249.206@o2ib (ffff90de2632b800): 0
      00000400:00000200:22.0:1618858538.614327:0:43692:0:(peer.c:2985:lnet_peer_data_present()) peer 10.249.249.206@o2ib(ffff90de2632b800): 0. state = 0x4371
      00000400:00000200:22.0:1618858538.614328:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 rc 1
      

      Next, it sends a discovery PUSH to the peer (NB push sent to 10.249.249.207@o2ib):

      *hornc@cflosbld08 hornc $ lpst2str.sh 0x4371
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_DISCOVERING
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PING_SENT
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      00000400:00000200:22.0:1618858538.614329:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371
      00000400:00000010:22.0:1618858538.614330:0:43692:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff90cd54ed3430 <<< PUSH MD.
      00000400:00000010:22.0:1618858538.614333:0:43692:0:(lib-lnet.h:535:lnet_rspt_alloc()) rspt alloc ffff90ddcde803f0
      ...
      00000400:00000200:22.0:1618858538.614351:0:43692:0:(lib-move.c:1897:lnet_handle_send()) TRACE: 10.249.248.2@o2ib(10.249.248.2@o2ib:<?>) -> 10.249.249.207@o2ib(10.249.249.206@o2ib:10.249.249.207@o2ib) <?> : PUT try# 0
      ...
      00000400:00000200:22.0:1618858538.614463:0:43692:0:(peer.c:3241:lnet_peer_send_push()) peer 10.249.249.206@o2ib
      00000400:00000200:22.0:1618858538.614463:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x771 rc 0
      *hornc@cflosbld08 hornc $ lpst2str.sh 0x771
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_DISCOVERING
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PING_SENT
      LNET_PEER_PUSH_SENT
      *hornc@cflosbld08 hornc $
      

      This is the first potential problem. I am not sure whether it is a valid state for the peer to have both PING_SENT and PUSH_SENT at the same time. I suspect it is not.

      The discovery ping eventually fails because o2iblnd cannot establish a connection with .207. The lnet_discovery_event_handler() sets the PING_FAILED state (via lnet_discovery_event_send()), and puts the peer back on the discovery queue.:

      00000800:00000100:29.0:1618858552.405824:0:147651:0:(o2iblnd_cb.c:3183:kiblnd_cm_callback()) 10.249.249.207@o2ib: ADDR ERROR -110
      00000800:00000200:29.0:1618858552.405825:0:147651:0:(o2iblnd.c:419:kiblnd_unlink_peer_locked()) peer_ni[ffff90cdedce3e80] -> 10.249.249.207@o2ib (2)--
      00000400:00000200:29.0:1618858552.405830:0:147651:0:(router.c:1720:lnet_notify()) 10.249.248.3@o2ib notifying 10.249.249.207@o2ib: down
      00000800:00000100:29.0:1618858552.405833:0:147651:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.249.249.207@o2ib: connection failed
      00000400:00000200:29.0:1618858552.405834:0:147651:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -113, hstatus = 2
      00000400:00000200:29.0:1618858552.405835:0:147651:0:(lib-msg.c:860:lnet_health_check()) health check: 10.249.248.3@o2ib->10.249.249.207@o2ib: GET: LOCAL_DROPPED
      00000400:00000200:29.0:1618858552.405836:0:147651:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.249.248.3@o2ib added to recovery queue. Health = 900
      00000400:00000100:29.0:1618858552.405837:0:147651:0:(lib-msg.c:710:lnet_attempt_msg_resend()) msg 0@<0:0>->10.249.249.207@o2ib exceeded retry count 2
      00000400:00000200:29.0:1618858552.405838:0:147651:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 5
      00000400:00000200:29.0:1618858552.405839:0:147651:0:(peer.c:2508:lnet_discovery_event_send()) Ping Send to 10.249.249.206@o2ib: 1  <<<< Clears PING_SENT, sets PING_FAILED
      

      Here we see a second problem. Note the "GET: LOCAL_DROPPED" from lnet_health_check() and subsequent call to lnet_handle_local_failure(). The LOCAL_DROPPED health status will only cause the LNet health value of the local NI to be lowered even though the problem could be with the remote interface. This is why LNet is repeatedly trying to send to .207 even though the interface is not functioning properly. The good news is this is a known issue with a fix. The ticket for this is https://jira.whamcloud.com/browse/LU-13571 .

      Discovery thread wakes and processes the ping failure:

      *hornc@cflosbld08 hornc $ lpst2str.sh 0x6d71
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_DISCOVERING
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PUSH_SENT
      LNET_PEER_PING_FAILED
      LNET_PEER_FORCE_PING
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      00000400:00000200:21.0:1618858552.405845:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:21.0:1618858552.405848:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6d71
      

      The discovery thread is going to process the ping failure. lnet_peer_ping_failed() is called. The lp_ping_mdh is unlinked. The unlink handler sees LNET_PEER_PUSH_SENT so it thinks that the PUSH is what was unlinked:

      00000400:00000200:21.0:1618858552.405850:0:43692:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 6
      00000400:00000200:21.0:1618858552.405851:0:43692:0:(peer.c:2536:lnet_discovery_event_unlink()) Push Unlink for message to peer 10.249.249.206@o2ib <<<< Clears PUSH_SENT, sets PUSH_FAILED
      00000400:00000200:21.0:1618858552.405853:0:43692:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff90cd54ed27f8 <<< PING MD
      00000400:00000200:21.0:1618858552.405855:0:43692:0:(peer.c:3018:lnet_peer_ping_failed()) peer 10.249.249.206@o2ib:-113
      00000400:00000200:21.0:1618858552.405856:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7171 rc -113
      00000400:00000200:21.0:1618858552.405858:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -113 <<<< Clears LNET_PEER_DISCOVERING, state is 0x7131
      00000400:00000200:21.0:1618858552.405859:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib <<<< Reference dropped
      *hornc@cflosbld08 hornc $ lpst2str.sh 0x7131
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PUSH_FAILED
      LNET_PEER_FORCE_PING
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      

      This is another potential problem. I do not know if it is valid to have a PUSH_FAILED state when discovery completes. Similarly, it seems wrong to have FORCE_PING/FORCE_PUSH set when discovery completes.

      As part of lnet_peer_discovery_complete(), any messages that were queued for discovery are sent/finalized (includes our OST_CONNECT that started this all off):

      00000100:00000200:21.0:1618858552.405863:0:43692:0:(events.c:63:request_out_callback()) @@@ type 5, status -113  req@ffff90bde43e5a00 x1697249298880320/t0(0) o8->scratch-OST00bc-osc-MDT0000@10.249.249.206@o2ib:28/4 lens 520/544 e 0 to 0 dl 1618858552 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 job:''
      

      An MGS_CONNECT reply results in peer being queued for discovery again in the same manner as the earlier OST_CONNECT:

      00010000:00000200:19.0:1618858552.405884:0:136340:0:(ldlm_lib.c:2986:target_send_reply_msg()) @@@ sending reply  req@ffff90cdf3d8a850 x1697492294959168/t0(0) o250->69523420-f6a9-9da1-ae47-897c569694e9@10.249.249.206@o2ib:0/0 lens 520/416 e 0 to 0 dl 1618858838 ref 1 fl Interpret:/0/0 rc 0/0 job:''
      00000400:00000010:19.0:1618858552.405888:0:136340:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff90cdf451caa0.
      00000100:00000200:19.0:1618858552.405888:0:136340:0:(niobuf.c:85:ptl_send_buf()) Sending 416 bytes to portal 25, xid 1697492294959168, offset 0
      00000400:00000200:19.0:1618858552.405890:0:136340:0:(lib-move.c:4905:LNetPut()) LNetPut -> 12345-10.249.249.206@o2ib
      00000400:00000200:19.0:1618858552.405900:0:136340:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: 0   <<<< Reference added
      00000400:00000200:19.0:1618858552.405900:0:136340:0:(peer.c:2250:lnet_discover_peer_locked()) Discovery attempt # 1
      00000400:00000200:19.0:1618858552.405901:0:136340:0:(peer.c:2291:lnet_discover_peer_locked()) non-blocking discovery
      00000400:00000200:19.0:1618858552.405902:0:136340:0:(peer.c:2298:lnet_discover_peer_locked()) peer 10.249.249.206@o2ib NID 10.249.249.206@o2ib: 0. pending discovery
      00000400:00000200:19.0:1618858552.405903:0:136340:0:(lib-move.c:2050:lnet_initiate_peer_discovery()) msg ffff90de1d222940 delayed. 10.249.249.206@o2ib pending discovery
      
      *hornc@cflosbld08 hornc $ lpst2str.sh 0x7171
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_DISCOVERING
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PUSH_FAILED
      LNET_PEER_FORCE_PING
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      00000400:00000200:21.0:1618858552.405902:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:21.0:1618858552.405906:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7171
      00000400:00000200:21.0:1618858552.405906:0:43692:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff90cd54ed3430
      00000400:00000200:21.0:1618858552.405907:0:43692:0:(peer.c:3146:lnet_peer_push_failed()) peer 10.249.249.206@o2ib
      00000400:00000200:21.0:1618858552.405908:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6171 rc -110
      00000400:00000200:21.0:1618858552.405913:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -110  <<<< Clears LNET_PEER_DISCOVERING, state is 0x6131
      00000400:00000200:21.0:1618858552.405915:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib 
      *hornc@cflosbld08 hornc $ lpst2str.sh 0x6131
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_FORCE_PING
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      

      Monitor thread attempts to resend PUSH

      00000400:00000200:11.0:1618858555.397794:0:43693:0:(lib-move.c:3156:lnet_resend_pending_msgs_locked()) resending <?>->12345-10.249.249.207@o2ib: PUT recovery 0 try# 1
      00000400:00000200:11.0:1618858555.397796:0:43693:0:(lib-move.c:2677:lnet_handle_send_case_locked()) Source ANY to MR:  10.249.249.207@o2ib local destination
      00000400:00000200:11.0:1618858555.397798:0:43693:0:(lib-move.c:1659:lnet_get_best_ni()) compare ni 10.249.248.2@o2ib [c:509, d:10, s:12279791] with best_ni not seleced [c:-2147483648, d:-1, s:0]
      00000400:00000200:11.0:1618858555.397800:0:43693:0:(lib-move.c:1659:lnet_get_best_ni()) compare ni 10.249.248.3@o2ib [c:509, d:12, s:7451] with best_ni 10.249.248.2@o2ib [c:509, d:10, s:12279791]
      00000400:00000200:11.0:1618858555.397801:0:43693:0:(lib-move.c:1702:lnet_get_best_ni()) selected best_ni 10.249.248.2@o2ib
      00000400:00000200:11.0:1618858555.397803:0:43693:0:(lib-move.c:1410:lnet_select_peer_ni()) n:[10.249.249.207@o2ib, 10.249.249.206@o2ib] h:[1000, 1000] r:[n, n] c:[8, 8] s:[67, 50963]
      00000400:00000200:11.0:1618858555.397804:0:43693:0:(lib-move.c:1460:lnet_select_peer_ni()) sd_best_lpni = 10.249.249.207@o2ib
      00000400:00000100:11.0:1618858555.397807:0:43693:0:(lib-move.c:937:lnet_post_send_locked()) Aborting message for 12345-10.249.249.207@o2ib: LNetM[DE]Unlink() already called on the MD/ME.
      00000400:00020000:11.0:1618858555.414727:0:43693:0:(lib-move.c:3162:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.249.249.207@o2ib: -125
      00000400:00000200:11.0:1618858555.429119:0:43693:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -125, hstatus = 5
      00000400:00000200:11.0:1618858555.429120:0:43693:0:(lib-msg.c:860:lnet_health_check()) health check: 10.249.248.2@o2ib->10.249.249.207@o2ib: PUT: LOCAL_ERROR
      00000400:00000200:11.0:1618858555.429121:0:43693:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 5
      00000400:00000200:11.0:1618858555.429122:0:43693:0:(peer.c:2508:lnet_discovery_event_send()) Push Send to 10.249.249.206@o2ib: 1   <<< Sets PUSH_FAILED, Peer placed back on discovery queue without reference taken BUG HERE
      00000400:00000200:11.0:1618858555.429125:0:43693:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff90cd54ed3430 <<< PUSH MD
      

      Discovery wakes, processes the failed push. Peer is not in DISCOVERING state, so lnet_peer_discovery_complete() is called. We drop an extra reference and trip the LBUG.

      *hornc@cflosbld08 hornc $ lpst2str.sh 0x7131
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_PUSH_FAILED
      LNET_PEER_FORCE_PING
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      00000400:00000200:22.0:1618858555.429128:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:22.0:1618858555.429130:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7131
      00000400:00000200:22.0:1618858555.429131:0:43692:0:(peer.c:3146:lnet_peer_push_failed()) peer 10.249.249.206@o2ib
      00000400:00000200:22.0:1618858555.429132:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6131 rc -125
      00000400:00000200:22.0:1618858555.429134:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -125
      00000400:00000200:22.0:1618858555.429135:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib <<< Reference dropped -> LBUG
      00000400:00000200:22.0:1618858555.429137:0:43692:0:(peer.c:293:lnet_destroy_peer_locked()) ffff90de2632b800 nid 10.249.249.206@o2ib
      *hornc@cflosbld08 hornc $ lpst2str.sh 0x6131
      LNET_PEER_MULTI_RAIL
      LNET_PEER_DISCOVERED
      LNET_PEER_REDISCOVER
      LNET_PEER_NIDS_UPTODATE
      LNET_PEER_FORCE_PING
      LNET_PEER_FORCE_PUSH
      *hornc@cflosbld08 hornc $
      

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: