Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12122

Deadlock with check_routers_before_use and discovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      I found this issue when testing Amir's new patches (see https://review.whamcloud.com/#/c/33651/9), but I believe the issue exists in current master.

      LNetNIInit() calls lnet_monitor_thr_start() -> lnet_router_post_mt_start() -> lnet_wait_known_routerstate()

      lnet_wait_known_routerstate() will wait indefinitely until all gateways have been discovered.

      However, the discovery thread is not started until after lnet_monitor_thr_start() returns. Thus, LNet never finishes starting.

      Logs slowly fill with:

      [7073564.123980] LNetError: 31952:0:(router.c:873:lnet_check_routers()) Failed to discover router 192.168.2.26@tcp4
      

      Reproduced on simple three node VM.

      The LNet configuration:

      sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 cat /etc/modprobe.d/lnet.conf | dshbak -c
      ----------------
      sles15s01
      ----------------
      options lnet networks="tcp(eth0)"
      options lnet routes="tcp4 192.168.2.26@tcp"
      options lnet lnet_peer_discovery_disabled=0
      options lnet check_routers_before_use=1
      ----------------
      sles15c01
      ----------------
      options lnet ip2nets="tcp4(eth0) 192.168.*.*; tcp99(eth0) 192.168.*.*"
      options lnet routes="tcp 192.168.2.26@tcp4"
      options lnet lnet_peer_discovery_disabled=0
      options lnet check_routers_before_use=1
      ----------------
      sles15build01
      ----------------
      options lnet ip2nets="tcp(eth0) 192.168.*.*; tcp4(eth1) 192.168.*.*; tcp99(eth1) 192.168.*.*"
      options lnet forwarding=enabled
      options lnet lnet_peer_discovery_disabled=0
      sles15build01:/etc/modprobe.d #
      

      Attempt to start LNet:

      sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01
      pdsh> modprobe lnet
      pdsh> lctl net up
      sles15build01: LNET configured
      

      LNet can start on the node acting as a router, but the hangs indefinitely on the other two nodes.

      Attachments

        Activity

          People

            wc-triage WC Triage
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: