Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14454

LNET routers added - then access issues with Lustre storage

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.3, Lustre 2.12.4
    • None
    • All Centos 7.x. Hardware is either Dell or Lenovo. IB infrastructure is EDR IB with a MSB7800 switch. MLNX OFED is 4.7-1.0.0.1 for lnet routers
    • 3
    • 9223372036854775807

    Description

      I built 2 new LNET routers and added them to our LNET env. The version of software of OS/LNET/MLNX OFED is exactly the same as 2 other existing lnet routers in this location. I added lnet routes on the 2 Lustre filesystem we have in this physical location to point to the 2 new lnet routers. I tested one client in another data center we have by adding the 2 lnet routes on the client to point to the new lnet routers. The client could read and write fine. The next day we were having issues from various clients with access to the 2 Lustre FS I had set LNET routes on previously. We ended up removing all the lnet routes to the 2 new lnet routers on the Lustre filesystems and things started to working again. So we ended up removing the 2 new lnet routers from our LNET env.

      LNET routers are running lnet 2.12.4, Lustre FS are lustre 2.12.3 and a very old version

      We have not experienced this before and was wondering it there is a specific procedure we have to follow to add new lnet routers in our environment ?

      The messages we were seeing on the lustre FS were for example:
      Feb 19 09:03:17 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 9 previous similar messages
      Feb 19 09:14:32 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.242.46.216@o2ib1 added to recovery queue. Health = 900
      Feb 19 09:14:32 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 9 previous similar messages
      Feb 19 09:25:47 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.242.46.217@o2ib1 added to recovery queue. Health = 900

      We were getting messages like the above for all 4 of the lnet routers, both the existing and the 2 new ones that were added.

      Also the hardware configuration of the 2 new LNET router is different. They have a dual port ConnectX-4 card running in ethernet mode at 10G and the 2 ports are LACP bonded, with a CX5 card for 100 rate IB. The older LNET routers have a ConnectX-4 IB card with IB rate 100 and a traditional 10G ethernet card with 2 10G and are LACP bonded. Not sure if this matters, but I wanted to mention it.

      Attachments

        1. log.txt
          72.97 MB
        2. log1.txt
          21.88 MB

        Activity

          People

            ssmirnov Serguei Smirnov
            mre64 Michael Ethier (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: