[LU-14454] LNET routers added - then access issues with Lustre storage Created: 19/Feb/21 Updated: 05/May/21 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Michael Ethier (Inactive) | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
All Centos 7.x. Hardware is either Dell or Lenovo. IB infrastructure is EDR IB with a MSB7800 switch. MLNX OFED is 4.7-1.0.0.1 for lnet routers |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I built 2 new LNET routers and added them to our LNET env. The version of software of OS/LNET/MLNX OFED is exactly the same as 2 other existing lnet routers in this location. I added lnet routes on the 2 Lustre filesystem we have in this physical location to point to the 2 new lnet routers. I tested one client in another data center we have by adding the 2 lnet routes on the client to point to the new lnet routers. The client could read and write fine. The next day we were having issues from various clients with access to the 2 Lustre FS I had set LNET routes on previously. We ended up removing all the lnet routes to the 2 new lnet routers on the Lustre filesystems and things started to working again. So we ended up removing the 2 new lnet routers from our LNET env. LNET routers are running lnet 2.12.4, Lustre FS are lustre 2.12.3 and a very old version We have not experienced this before and was wondering it there is a specific procedure we have to follow to add new lnet routers in our environment ? The messages we were seeing on the lustre FS were for example: We were getting messages like the above for all 4 of the lnet routers, both the existing and the 2 new ones that were added. Also the hardware configuration of the 2 new LNET router is different. They have a dual port ConnectX-4 card running in ethernet mode at 10G and the 2 ports are LACP bonded, with a CX5 card for 100 rate IB. The older LNET routers have a ConnectX-4 IB card with IB rate 100 and a traditional 10G ethernet card with 2 10G and are LACP bonded. Not sure if this matters, but I wanted to mention it. |
| Comments |
| Comment by Michael Ethier (Inactive) [ 19/Feb/21 ] |
|
Actually, those recoveryq messages I mentioned above may not be an issue in regards to the access problem I described. I can see those messages in /var/log/messages on one of the lustre FS - at much earlier times, like weeks ago - before we experienced the access issue. |
| Comment by Peter Jones [ 19/Feb/21 ] |
|
Cyril Could you please assist with this one? Thanks Peter |
| Comment by Michael Ethier (Inactive) [ 24/Feb/21 ] |
|
Hello, |
| Comment by Cyril Bordage [ 24/Feb/21 ] |
|
Hello Michael, sorry for the late answer, I had to take unexpected leave. Could you provide the outputs of the following commands from the servers, the routers and several clients? lnetctl peer show -v 4 lnetctl net show -v 4 lnetctl global show lnetctl route show
Thank you. Cyril. |
| Comment by Michael Ethier (Inactive) [ 25/Feb/21 ] |
|
Hi Cyril, no worries. So those 2 lnet routers with issues do not have their lnet active to I can't run the commands. If we add routes to the Lustre storage and to some clients that's when we run into issues with accessing the Lustre storage afterwards. [root@boslnet03 ~]# lnetctl peer show -v 4
Thanks, |
| Comment by Cyril Bordage [ 26/Feb/21 ] |
|
Hello Michael, I do not get your "those 2 lnet routers with issues do not have their lnet active to I can't run the commands.". You mean you cannot mess up your working configuration by enabling them? I will be difficult to diagnose with little information… To see what is going on I need to have all requested details with the exact commands I provided. If you cannot mess up with your production environment, could you provide the commands you used to configure everything, with the details of your network (nids of the servers, the routers, the clients), and all available information for servers and clients. Thank you. Cyril.
|
| Comment by Michael Ethier (Inactive) [ 04/Mar/21 ] |
|
Hi Cyril, |
| Comment by Michael Ethier (Inactive) [ 11/Mar/21 ] |
|
Hi Cyril, Your requested commands on a working lnet router: [root@boslnet01 ~]# lnetctl net show -v 4
Your commands on a non-working lnet router:
Your commands on a Lustre storage node (MDS):
|
| Comment by Michael Ethier (Inactive) [ 11/Mar/21 ] |
|
Details of existing working lnet router:
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 p1p1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 p1p2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 p3p1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 p3p2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 Details of one of our lustre storage servers: [root@boslfs02mds01 ~]# rpm -qa |grep lustre
em1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 em2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 em3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 em4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 idrac: flags=67<UP,BROADCAST,RUNNING> mtu 1500 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 Deails of a lnet router that doesn't work: [root@boslnet03 ~]# dkms status
enp0s20f0u1u6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 p1p1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 p1p2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 |
| Comment by Serguei Smirnov [ 18/Mar/21 ] |
|
Michael, Would it be possible to set up a live debugging session so that we can go through the procedure of adding a new router together? I haven't seen anything wrong in the logs you provided, but the errors mentioned in the description could be explained by mis-configuration. The procedure should be roughly as follows: 1) Setup router node: configure lnet, start lnet, verify by dumping "lnetctl net show", and by "lnetctl pinging" to and from peers on both nets (tcp1, o2ib1) multiple times. 2) Add the route on the server and on the client, list the new router as the gw to use to reach respective nets. 3) Verify by "lnetctl pinging" multiple times across the new router (a tcp1 client to server and back). 4) Verify by "lnetctl pinging" multiple times across the new router using a client on tcp2 (from the logs provided, the new router handles only tcp1, so let's verify that tcp2 is still good). 5) If all looks good so far, try mounting |
| Comment by Michael Ethier (Inactive) [ 23/Mar/21 ] |
|
Hi, i'm attaching 2 logs files per Serguei's request. Thanks, Mike. log.txt |
| Comment by Serguei Smirnov [ 01/Apr/21 ] |
|
During the today's call we found out that the new router's IP may need to be added to the access control list for a group of clients. Regular ping from the client to the router was going through, but lnetctl ping was not. Because lnetctl ping was part of the procedure we used earlier, failed lnetctl ping we're seeing now may not explain the behaviour we were seeing before. We'll proceed once the ACL issue is out of the way. |
| Comment by Michael Ethier (Inactive) [ 01/Apr/21 ] |
|
I verified that from the client we can lnet ping the 2 new lnet routers now. There were ACLs blocking access.
|
| Comment by Michael Ethier (Inactive) [ 02/Apr/21 ] |
|
It has been verified that network ACLs were causing the issue. We have successfully added a new LNET router and Lustre FS access seems to be fine now. This ticket can be closed. Thanks to Serguei for all his help, much appreciated. |
| Comment by Peter Jones [ 03/Apr/21 ] |
|
Great - thanks for the update |
| Comment by Serguei Smirnov [ 03/Apr/21 ] |
|
Michael reported later yesterday via e-mail that the clients which didn't get the route to the new gateway setup had issues with accessing the FS. This should not have happened unless asymmetric routes are configured to be dropped on the clients. |
| Comment by Serguei Smirnov [ 05/May/21 ] |
|
Michael recently reported via email: "I wanted to let you know we finally added those 2 lnet routers we were working previously, globally to our cluster in Holyoke/boston and they are now running in production. It appears the procedure requires that you add the lnet routes across all the clients that are mounting the storage on the other side of the routers and then add lnet routes to the storage side after that." Michael believes that the ticket can be closed. |