Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.3, Lustre 2.12.4
-
None
-
All Centos 7.x. Hardware is either Dell or Lenovo. IB infrastructure is EDR IB with a MSB7800 switch. MLNX OFED is 4.7-1.0.0.1 for lnet routers
-
3
-
9223372036854775807
Description
I built 2 new LNET routers and added them to our LNET env. The version of software of OS/LNET/MLNX OFED is exactly the same as 2 other existing lnet routers in this location. I added lnet routes on the 2 Lustre filesystem we have in this physical location to point to the 2 new lnet routers. I tested one client in another data center we have by adding the 2 lnet routes on the client to point to the new lnet routers. The client could read and write fine. The next day we were having issues from various clients with access to the 2 Lustre FS I had set LNET routes on previously. We ended up removing all the lnet routes to the 2 new lnet routers on the Lustre filesystems and things started to working again. So we ended up removing the 2 new lnet routers from our LNET env.
LNET routers are running lnet 2.12.4, Lustre FS are lustre 2.12.3 and a very old version
We have not experienced this before and was wondering it there is a specific procedure we have to follow to add new lnet routers in our environment ?
The messages we were seeing on the lustre FS were for example:
Feb 19 09:03:17 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 9 previous similar messages
Feb 19 09:14:32 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.242.46.216@o2ib1 added to recovery queue. Health = 900
Feb 19 09:14:32 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 9 previous similar messages
Feb 19 09:25:47 boslfs02mds01 kernel: LNetError: 6413:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.242.46.217@o2ib1 added to recovery queue. Health = 900
We were getting messages like the above for all 4 of the lnet routers, both the existing and the 2 new ones that were added.
Also the hardware configuration of the 2 new LNET router is different. They have a dual port ConnectX-4 card running in ethernet mode at 10G and the 2 ports are LACP bonded, with a CX5 card for 100 rate IB. The older LNET routers have a ConnectX-4 IB card with IB rate 100 and a traditional 10G ethernet card with 2 10G and are LACP bonded. Not sure if this matters, but I wanted to mention it.