Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17855

Router peer health feature is not working correctly

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Suppose we have peer A send a message to peer D with a router in the middle. This is the credit flow for the router:

      Peer NI A -> Router NI B - Router NI C -> Peer NI D
      1. Router takes Peer NI A router credit (aka the peer router credit or peer rx credit)
      2. Router takes router buffer credit based on payload size (aka the router buffer credit)
      3. Router completes receive from A
      4. Router takes Peer NI D tx credit (aka the peer tx credit)
      5. Router takes Router NI C tx credit (aka the network tx credit)
      6. Router completes send
      7. Credits are returned in the reverse order

      We can see that messages from peer A that are destined for other peers could be queued behind messages for peer D because the RX credits are only returned when a message is fully processed by the router. At the same time, the peer TX queue for peer D can grow as deep as (# max peer router credits * # peers).

      Here we see an lnet router processing death notification of a Lustre server:

      dklog.router.486.20240303_010620.llog:00000400:00000200:30.0:1709449436.755868:0:8186:0:(router.c:1741:lnet_notify()) 10.12.2.2@o2ib4001 notifying 10.12.2.51@o2ib4001: down
      dklog.router.93.20240303_010620.llog:00000400:00000200:10.0:1709449437.350706:0:7862:0:(router.c:1741:lnet_notify()) 10.12.2.3@o2ib4001 notifying 10.12.2.51@o2ib4001: down
      dklog.router.94.20240303_010620.llog:00000400:00000200:6.0:1709449436.507254:0:7860:0:(router.c:1741:lnet_notify()) 10.12.2.4@o2ib4001 notifying 10.12.2.51@o2ib4001: down
      

      At time of server death notification, the routers already had 100s of messages on 10.12.2.51@o2ib4001 peer TX queue. After server death notification, the routers attempt to send all pending messages to 10.12.2.51@o2ib4001 in groups of 16 (peer_credits value). Each group of 16 takes 10-13 seconds to timeout. Because of eager receive (note, even without eager receive credits can be returned to message originator before the forward completes), router may have unlimited # of messages in RX queue destined for 10.12.2.51@o2ib4001. If there are 3*16 messages back-to-back in RX queue for 10.12.2.51@o2ib4001, then any messages behind those will not be sent for 30-40 seconds.

      [hornc@s-lmo-kalina client_dklogs.saturn-p2.20240303_010618]$ grep -e lnet_notify -e 'Timed out tx' -e 'kiblnd_send' -e '10.12.2.51@o2ib4001 LPNI seq info' dklog.router.486.20240303_010620.llog | grep 10.12.2.51@o2ib4001
      ...
      00000400:00000200:30.0:1709449436.755868:0:8186:0:(router.c:1741:lnet_notify()) 10.12.2.2@o2ib4001 notifying 10.12.2.51@o2ib4001: down
      00000400:00000200:10.0:1709449436.845884:0:8176:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79508462:79508462:2017:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703536:11703536:-258:4294967295]
      00000400:00000200:10.0:1709449436.849029:0:8176:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79508547:79508547:2016:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703537:11703537:-259:4294967295]
      00000400:00000200:11.0:1709449436.849127:0:8179:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79508551:79508551:2016:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703538:11703538:-260:4294967295]
      ^^^^^ Shows peer TX queue growing
      ...
      00000800:00000400:30.0:1709449448.718577:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746688:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746690:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746691:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746691:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746692:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746693:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746694:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746695:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746696:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746697:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746698:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746699:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746700:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746701:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      00000800:00000400:30.0:1709449448.746701:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 1 seconds
      ^^^^ First timeouts ~12s after death notification
      ...
      00000800:00000200:30.0:1709449448.746714:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449448.746760:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746778:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579768:79579768:2019:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703809:11703809:-529:4294967295]
      00000800:00000200:30.0:1709449448.746784:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449448.746814:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746829:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579772:79579772:2026:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703810:11703810:-528:4294967295]
      00000800:00000200:30.0:1709449448.746835:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449448.746866:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746880:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579775:79579775:2030:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703811:11703811:-527:4294967295]
      00000800:00000200:30.0:1709449448.746886:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449448.746915:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746929:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579777:79579777:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703812:11703812:-526:4294967295]
      00000800:00000200:30.0:1709449448.746934:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 448 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746947:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579778:79579778:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703813:11703813:-526:4294967295]
      00000800:00000200:30.0:1709449448.746952:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746966:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579779:79579779:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703814:11703814:-526:4294967295]
      00000800:00000200:30.0:1709449448.746971:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.746988:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579782:79579782:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703815:11703815:-526:4294967295]
      00000800:00000200:30.0:1709449448.746993:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449448.747007:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579783:79579783:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703816:11703816:-526:4294967295]
      00000800:00000200:30.0:1709449448.747013:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:21.0:1709449448.747030:0:8191:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579786:79579786:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703817:11703817:-526:4294967295]
      00000800:00000200:30.0:1709449448.747047:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449448.747077:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:21.0:1709449448.747095:0:8191:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79579792:79579792:2027:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703818:11703818:-525:4294967295]
      00000800:00000200:30.0:1709449448.747111:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001   
      ^^^^^ 16 new sends while queue continues to grow
      ...
      00000800:00000400:30.0:1709449461.742625:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 2 seconds
      00000800:00000400:30.0:1709449461.769846:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769847:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769848:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769849:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769850:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769851:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769851:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769852:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769853:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769854:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769855:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769855:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769856:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769857:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      00000800:00000400:30.0:1709449461.769858:0:8186:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.12.2.51@o2ib4001: 3 seconds
      ^^^^^ Next batch of timeouts ~13s later
      ...
      00000800:00000200:30.0:1709449461.769870:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.769931:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 448 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449461.769963:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588061:79588061:2031:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703914:11703914:-618:4294967295]
      00000800:00000200:30.0:1709449461.769969:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:30.0:1709449461.770001:0:8186:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588062:79588062:2031:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703915:11703915:-618:4294967295]
      00000800:00000200:30.0:1709449461.770006:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770035:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770068:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:20.0:1709449461.770101:0:8192:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588068:79588068:2029:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703916:11703916:-616:4294967295]
      00000800:00000200:30.0:1709449461.770103:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770146:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770200:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:20.0:1709449461.770231:0:8192:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588077:79588077:2028:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703917:11703917:-614:4294967295]
      00000800:00000200:30.0:1709449461.770249:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770292:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770348:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770398:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770451:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000400:00000200:20.0:1709449461.770458:0:8192:0:(lib-move.c:1895:lnet_handle_send()) 10.12.2.2@o2ib4001 NI seq info: [79588089:79588089:2024:4294967295] 10.12.2.51@o2ib4001 LPNI seq info [11703918:11703918:-610:4294967295]
      00000800:00000200:30.0:1709449461.770495:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      00000800:00000200:30.0:1709449461.770542:0:8186:0:(o2iblnd_cb.c:1680:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.12.2.51@o2ib4001
      ^^^^ 16 new sends, queue grows
      ...
      etc.
      

      The problem is that lnet health values are not manipulated on single-rail nodes
      (because there are no alternative paths in the network), and router's peer health feature relies on LNet health as well as peer NI status delivered by discovery to determine whether peers are up or down.

      We can fix the issue by having routers decrement/increment health even for single-rail peers or by changing lnet_notify() to set the discovery NI status down/up (or both of these things).

      Attachments

        Activity

          [LU-17855] Router peer health feature is not working correctly
          pjones Peter Jones added a comment -

          Merged for 2.16

          pjones Peter Jones added a comment - Merged for 2.16

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55342/
          Subject: LU-17855 lnet: Set peer NI down on lnet_notify
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 2a0127d5fba5462bdd2eb7530baeded454104a5a

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55342/ Subject: LU-17855 lnet: Set peer NI down on lnet_notify Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2a0127d5fba5462bdd2eb7530baeded454104a5a

          "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55342
          Subject: LU-17855 lnet: Set peer NI down on lnet_notify
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 2d3eb72d00c5ea316c02be7ba8eda59a22d8f5f1

          gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55342 Subject: LU-17855 lnet: Set peer NI down on lnet_notify Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2d3eb72d00c5ea316c02be7ba8eda59a22d8f5f1

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: