Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13764

lp_dc_error is not cleared when new discovery ping is sent as response to incoming push

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This bug can result in routes being marked down when they are in fact up:

      Ping fails: lp_dc_error is set, and lp_state is set with FORCE_PING

      00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99
      00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30275 rc 0
      00000400:00000200:25.0:1594238420.912144:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:25.0:1594238420.912146:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30875
      00000400:00000200:25.0:1594238420.912147:0:44689:0:(peer.c:2523:lnet_discovery_event_handler()) Received event: 6
      00000400:00000200:25.0:1594238420.912148:0:44689:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff905fbb363210
      00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:2954:lnet_peer_ping_failed()) peer 601@gni99:-113
      00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x32075 rc -113
      00000400:00000200:25.0:1594238420.912152:0:44689:0:(peer.c:3161:lnet_peer_discovery_error()) Discovery error 601@gni99: -113
      00000400:00000200:25.0:1594238420.912153:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99
      00000400:00000200:25.0:1594238420.912154:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113
      

      Push arrives

      00000400:00000200:0.0:1594238425.096430:0:43969:0:(api-ni.c:2085:lnet_push_target_event_handler()) type 2 status 0 unlinked 1
      00000400:00000200:0.0:1594238425.096432:0:43969:0:(peer.c:2065:lnet_peer_push_event()) peer 601@gni99(ffff90348c6c8000) is MR
      00000400:00000200:0.0:1594238425.096433:0:43969:0:(peer.c:2142:lnet_peer_push_event()) Received Push 601@gni99 1
      

      Push is processed:

      00000400:00000200:25.0:1594238425.096441:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:25.0:1594238425.096445:0:44689:0:(api-ni.c:2075:lnet_push_target_post()) posted push target ffff9064ad063000
      00000400:00000200:25.0:1594238425.096449:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x220f5
      00000400:00000200:25.0:1594238425.096452:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030
      00000400:00000200:25.0:1594238425.096453:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030
      00000400:00000200:25.0:1594238425.096454:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030
      00000400:00000200:25.0:1594238425.096455:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0
      00000400:00000200:25.0:1594238425.096456:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x22175
      00000400:00000200:25.0:1594238425.096458:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175 rc 1
      

      Peer is marked with FORCE_PING, so a ping is sent:

      00000400:00000200:25.0:1594238425.096459:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175
      00000400:00000200:25.0:1594238425.096462:0:44689:0:(lib-move.c:5068:LNetGet()) LNetGet -> 12345-10.16.100.10@o2ib10
      ...
      00000400:00000200:25.0:1594238425.096475:0:44689:0:(lib-move.c:1855:lnet_handle_send()) TRACE: 10.16.100.60@o2ib10(10.16.100.60@o2ib10:<?>) -> 10.16.100.10@o2ib10(10.16.100.10@o2ib10:10.16.100.10@o2ib10) <?> : GET try# 0
      00000800:00000200:25.0:1594238425.096477:0:44689:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.16.100.10@o2ib10
      00000800:00000200:25.0:1594238425.096480:0:44689:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9034adc34a80] -> 10.16.100.10@o2ib10 (2) version: 12
      00000800:00000200:25.0:1594238425.096481:0:44689:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (36)++
      00000800:00000200:25.0:1594238425.096482:0:44689:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff90345d3e1e00] (37)++
      00000800:00000200:25.0:1594238425.096485:0:44689:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (38)--
      00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99
      00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20375 rc 0
      

      Ping reply is processed:

      00000400:00000200:25.0:1594238425.096580:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0
      00000400:00000200:25.0:1594238425.096581:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x201f5
      00000400:00000200:25.0:1594238425.096582:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030
      00000400:00000200:25.0:1594238425.096583:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030
      00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030
      00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0
      00000400:00000200:25.0:1594238425.096585:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x20175
      00000400:00000200:25.0:1594238425.096586:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175 rc 1
      00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175
      00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3180:lnet_peer_discovered()) peer 601@gni99
      00000400:00000200:25.0:1594238425.096588:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20115 rc 0
      00000400:00000200:25.0:1594238425.096589:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99
      

      lp_dc_error was never cleared, so it looks like discovery failed, and route is marked down:

      00000400:00000200:25.0:1594238425.096590:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: