Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Suppose we have peer A send a message to peer D with a router in the middle. This is the credit flow for the router:
Peer NI A -> Router NI B - Router NI C -> Peer NI D
- Router takes Peer NI A router credit (aka the peer router credit or peer rx credit)
- Router takes router buffer credit based on payload size (aka the router buffer credit)
- Router completes receive from A
- Router takes Peer NI D tx credit (aka the peer tx credit)
- Router takes Router NI C tx credit (aka the network tx credit)
- Router completes send
- Credits are returned in the reverse order
We can see that messages from peer A that are destined for other peers could be queued behind messages for peer D because the RX credits are only returned when a message is fully processed by the router. At the same time, the peer TX queue for peer D can grow as deep as (# max peer router credits * # peers).
Here we see an lnet router processing death notification of a Lustre server:
dklog.router.486.20240303_010620.llog:00000400:00000200:30.0:1709449436.755868:0:8186:0:(router.c:1741:lnet_notify()) 10.12.2.2@o2ib4001 notifying 10.12.2.51@o2ib4001: down dklog.router.93.20240303_010620.llog:00000400:00000200:10.0:1709449437.350706:0:7862:0:(router.c:1741:lnet_notify()) 10.12.2.3@o2ib4001 notifying 10.12.2.51@o2ib4001: down dklog.router.94.20240303_010620.llog:00000400:00000200:6.0:1709449436.507254:0:7860:0:(router.c:1741:lnet_notify()) 10.12.2.4@o2ib4001 notifying 10.12.2.51@o2ib4001: down
At time of server death notification, the routers already had 100s of messages on 10.12.2.51@o2ib4001 peer TX queue. After server death notification, the routers attempt to send all pending messages to 10.12.2.51@o2ib4001 in groups of 16 (peer_credits value). Each group of 16 takes 10-13 seconds to timeout. Because of eager receive (note, even without eager receive credits can be returned to message originator before the forward completes), router may have unlimited # of messages in RX queue destined for 10.12.2.51@o2ib4001. If there are 3*16 messages back-to-back in RX queue for 10.12.2.51@o2ib4001, then any messages behind those will not be sent for 30-40 seconds.
[hornc@s-lmo-kalina client_dklogs.saturn-p2.20240303_010618]$ grep -e lnet_notify -e 'Timed out tx' -e 'kiblnd_send' -e '10.12.2.51@o2ib4001 LPNI seq info' dklog.router.486.20240303_010620.llog | grep 10.12.2.51@o2ib4001 ... 00000400:00000200:30.0:1709449436.755868:0:8186:0:(router.c:1741:lnet_notify()) 10.12.2.2@o2ib4001 notifying 10.12.2.51@o2ib4001: down 00000400:00000200:10.0:1709449436.845884:0:8176:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79508462:79508462:2017:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703536:11703536:-258:4294967295] 00000400:00000200:10.0:1709449436.849029:0:8176:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79508547:79508547:2016:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703537:11703537:-259:4294967295] 00000400:00000200:11.0:1709449436.849127:0:8179:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79508551:79508551:2016:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703538:11703538:-260:4294967295] ^^^^^ Shows peer TX queue growing ... 00000800:00000400:30.0:1709449448.718577:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746688:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746690:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746691:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746691:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746692:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746693:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746694:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746695:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746696:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746697:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746698:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746699:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746700:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746701:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds 00000800:00000400:30.0:1709449448.746701:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds ^^^^ First timeouts ~12s after death notification ... 00000800:00000200:30.0:1709449448.746714:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449448.746760:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746778:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579768:79579768:2019:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703809:11703809:-529:4294967295] 00000800:00000200:30.0:1709449448.746784:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449448.746814:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746829:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579772:79579772:2026:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703810:11703810:-528:4294967295] 00000800:00000200:30.0:1709449448.746835:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449448.746866:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746880:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579775:79579775:2030:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703811:11703811:-527:4294967295] 00000800:00000200:30.0:1709449448.746886:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449448.746915:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746929:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579777:79579777:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703812:11703812:-526:4294967295] 00000800:00000200:30.0:1709449448.746934:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 448 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746947:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579778:79579778:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703813:11703813:-526:4294967295] 00000800:00000200:30.0:1709449448.746952:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746966:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579779:79579779:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703814:11703814:-526:4294967295] 00000800:00000200:30.0:1709449448.746971:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.746988:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579782:79579782:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703815:11703815:-526:4294967295] 00000800:00000200:30.0:1709449448.746993:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449448.747007:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579783:79579783:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703816:11703816:-526:4294967295] 00000800:00000200:30.0:1709449448.747013:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:21.0:1709449448.747030:0:8191:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579786:79579786:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703817:11703817:-526:4294967295] 00000800:00000200:30.0:1709449448.747047:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449448.747077:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:21.0:1709449448.747095:0:8191:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579792:79579792:2027:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703818:11703818:-525:4294967295] 00000800:00000200:30.0:1709449448.747111:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 ^^^^^ 16 new sends while queue continues to grow ... 00000800:00000400:30.0:1709449461.742625:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 2 seconds 00000800:00000400:30.0:1709449461.769846:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769847:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769848:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769849:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769850:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769851:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769851:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769852:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769853:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769854:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769855:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769855:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769856:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769857:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds 00000800:00000400:30.0:1709449461.769858:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds ^^^^^ Next batch of timeouts ~13s later ... 00000800:00000200:30.0:1709449461.769870:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.769931:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 448 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449461.769963:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588061:79588061:2031:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703914:11703914:-618:4294967295] 00000800:00000200:30.0:1709449461.769969:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:30.0:1709449461.770001:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588062:79588062:2031:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703915:11703915:-618:4294967295] 00000800:00000200:30.0:1709449461.770006:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770035:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770068:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:20.0:1709449461.770101:0:8192:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588068:79588068:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703916:11703916:-616:4294967295] 00000800:00000200:30.0:1709449461.770103:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770146:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770200:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:20.0:1709449461.770231:0:8192:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588077:79588077:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703917:11703917:-614:4294967295] 00000800:00000200:30.0:1709449461.770249:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770292:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770348:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770398:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770451:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000400:00000200:20.0:1709449461.770458:0:8192:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588089:79588089:2024:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703918:11703918:-610:4294967295] 00000800:00000200:30.0:1709449461.770495:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 00000800:00000200:30.0:1709449461.770542:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001 ^^^^ 16 new sends, queue grows ... etc.
The problem is that lnet health values are not manipulated on single-rail nodes
(because there are no alternative paths in the network), and router's peer health feature relies on LNet health as well as peer NI status delivered by discovery to determine whether peers are up or down.
We can fix the issue by having routers decrement/increment health even for single-rail peers or by changing lnet_notify() to set the discovery NI status down/up (or both of these things).
Merged for 2.16