Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16106

lnet network NIs go down when they have no peers and check_routers_before_use=1

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.2
    • Lustre 2.15.0
    • lustre-2.15.0_3.llnl-3.t4.x86_64
      TOSS 4.4-5
    • 3
    • 9223372036854775807

    Description

      When starting lnet, we observe NIs going down after about 110 sec. The nodes that we've observed this issue in are router nodes. We setting are setting check_routers_before_use=1 in out lnet module parameters. We do not see this issue with check_routers_before_use=0.

      The network is
      o2ib18 <{}> tcp129 <{}> o2ib100

      The routers of interest are opal[187-190]
      opal[187,188] route between o2ib18 and tcp129
      opal[189,190] route between tcp129 and o2ib100

      o2ib100 includes many non-opal nodes, including LNet routers and MDS and OSS nodes.

      The issue was first observed on the tcp network. However, stopping lnet on all nodes in o2ib18 then starting it on opal187 showed the same symptoms on the infiniband NI.

      When the NI status was down, traffic was unable to flow between compute nodes on o2ib18 and a filesystem on o2ib100. Also, pings don't work between nodes with down NIs.

      However, when starting some nodes opal[188,190] with check_routers_before_use not set, then starting opal[187,189] with check_routers_before_use=1, opal[187,189] are able to ping and be pinged by opal[188,190], but can't ping each other or themselves.

      We noticed this when booting the opal cluster. All the non-opal nodes on o2ib100 were up and LNet was running on those non-opal nodes. The opal router nodes listed above were powered on first, and after they were up and LNet was started, the rest of the opal nodes (about 60 lustre clients) were booted. We found that the opal routers NIs were down and the opal clients could not ping through the opal routers to the MDS and OSS nodes on o2ib100. This is a concern because this scenario occurs when we update operating system versions or recover from power outages.

      Attachments

        Activity

          [LU-16106] lnet network NIs go down when they have no peers and check_routers_before_use=1

          I tested patch set 3 and it looks good.

          Sorry for the late notification, the testing resources with the correct setup for that test have been unavailable lately.

          defazio Gian-Carlo Defazio added a comment - I tested patch set 3 and it looks good. Sorry for the late notification, the testing resources with the correct setup for that test have been unavailable lately.

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48529/
          Subject: LU-16106 lnet: allow direct messages regardless of peer NI status
          Project: fs/lustre-release
          Branch: b2_15
          Current Patch Set:
          Commit: 9ae1fc3e0e4507c242c5f379e6364ad270d865c0

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48529/ Subject: LU-16106 lnet: allow direct messages regardless of peer NI status Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 9ae1fc3e0e4507c242c5f379e6364ad270d865c0
          pjones Peter Jones added a comment -

          This fix has now landed for 2.16. We still need to track the port to b2_15 being merged and confirm that LLNL's testing is successful with the latest version

          pjones Peter Jones added a comment - This fix has now landed for 2.16. We still need to track the port to b2_15 being merged and confirm that LLNL's testing is successful with the latest version

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48355/
          Subject: LU-16106 lnet: allow direct messages regardless of peer NI status
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 3345a8a54e89c342a4ce2d8d4bcb04ee919bcd52

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48355/ Subject: LU-16106 lnet: allow direct messages regardless of peer NI status Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3345a8a54e89c342a4ce2d8d4bcb04ee919bcd52

          We have not tested patch set 3.

          defazio Gian-Carlo Defazio added a comment - We have not tested patch set 3.

          "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/48529
          Subject: LU-16106 lnet: allow direct messages regardless of peer NI status
          Project: fs/lustre-release
          Branch: b2_15
          Current Patch Set: 1
          Commit: d747b9c24b9c8366f0551a7b790aad30b3a80786

          gerrit Gerrit Updater added a comment - "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/48529 Subject: LU-16106 lnet: allow direct messages regardless of peer NI status Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: d747b9c24b9c8366f0551a7b790aad30b3a80786
          ofaaland Olaf Faaland added a comment -

          Gian,

          Have we tested with the final version of the patch (set 3)?

          thanks,

          ofaaland Olaf Faaland added a comment - Gian, Have we tested with the final version of the patch (set 3)? thanks,

          The patch seems to solve the issues we were having. Starting the routers in any order works now. This is with check_routers_before_use=1 of course. The routers can ping each other and compute nodes can get through to routers to the file system.

          defazio Gian-Carlo Defazio added a comment - The patch seems to solve the issues we were having. Starting the routers in any order works now. This is with check_routers_before_use=1 of course. The routers can ping each other and compute nodes can get through to routers to the file system.

          "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48355
          Subject: LU-16106 lnet: ignore peer ni down status if it was never up
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 13e7664eef06d081366c7be2e8b43c186f70a429

          gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48355 Subject: LU-16106 lnet: ignore peer ni down status if it was never up Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 13e7664eef06d081366c7be2e8b43c186f70a429

          Serguei,

          I've seen the same issue now for our local lustre 2.14 and 2.12 when all other router nodes are down (or have lnet down) and a single node starts lnet. I need to collect more data on what combinations or nodes, startup orders, and module parameters cause the problem. I'll post that info early next week.

          defazio Gian-Carlo Defazio added a comment - Serguei, I've seen the same issue now for our local lustre 2.14 and 2.12 when all other router nodes are down (or have lnet down) and a single node starts lnet. I need to collect more data on what combinations or nodes, startup orders, and module parameters cause the problem. I'll post that info early next week.

          People

            ssmirnov Serguei Smirnov
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: