[LU-2950] Port LNET routes_config change done by Fujitsu in latest lustre Created: 12/Mar/13  Updated: 07/Aug/14  Due: 20/Mar/13  Resolved: 01/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.4.0

Type: Improvement Priority: Major
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: Fujitsu, lnet

Attachments: File legacy_routes_test     File new_routes_test    
Issue Links:
Related
is related to LUDOC-147 Update Lustre Manual for Port LNET ro... Resolved
Rank (Obsolete): 7082

 Description   

Port a change made by Fujitsu where they added a new lnet module param: routes_config, which can specify a file that contains a set of routes to be used by lnet



 Comments   
Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ]

Please be aware of: LU-1071 able to handle many lnet routes

I really don't like the fact of reading a file into the kernel space but I guess it is seen as a necessary evil at this point. Please safeguard all file actions as much as you can.

Comment by Amir Shehata (Inactive) [ 15/Mar/13 ]

We took Issac's suggestion and we created a shell script that reads a routes config file then uses lctl to configure lnet with the routes. This avoids opening files in kernel space.

Comment by Amir Shehata (Inactive) [ 25/Mar/13 ]

test files

Comment by Amir Shehata (Inactive) [ 25/Mar/13 ]

We created two scripts. One to convert from the legacy syntax to the new syntax, and the second to take a file of the new syntax and configure lustre using lctl.

The legacy syntax (which is currently being used in the lustre.conf file):
<network> [<hop>] <gateway>@<exit network>[:<priority>]

The new syntax (which is used in the new scripts):
<network>:

{ gateway: <gateway>@<exit network> [hope: <hop>] [priority: <prioirty>] }

I would think the intention is to move the syntax to the new form in the future, so there isn't three different syntax for configuring routes.

Below is sample test runs. I have also attached two files to this Bug that were used for testing the scripts.

[ashehata@localhost scripts]$ ~/lustre-master/lustre/scripts/lustre_routes_conversion legacy_routes_test new_routes_test
tcp1 23 192.168.213.1@tcp:1 tcp5 34 193.30.4.3@tcp:4
tcp2 54 10.1.3.2@tcp
tcp3 10.3.4.3@tcp:3
tcp4 10.3.3.4@tcp
lustre_routes_conversion: converted routes written to test9
[ashehata@localhost scripts]$ ~/lustre-master/lustre/scripts/lustre_routes_config --dry-run --verbose new_routes_test
lctl --net tcp1 add_route 192.168.213.1@tcp 23 1
lctl --net tcp5 add_route 193.30.4.3@tcp 34 4
lctl --net tcp2 add_route 10.1.3.2@tcp 54 4
lctl --net tcp3 add_route 10.3.4.3@tcp 1 3
lctl --net tcp4 add_route 10.3.3.4@tcp 1 3

Comment by Keith Mannthey (Inactive) [ 26/Mar/13 ]

Can you submit your scripts into gerrit?

Comment by Amir Shehata (Inactive) [ 26/Mar/13 ]

The code inspection is @ http://review.whamcloud.com/#change,5757

Comment by Hiroya Nozaki [ 04/Apr/13 ]

Hi, Amir. Nice to meet you.

I heartily agree your implementation because I also think reading a file into kernel space can be a kind of dangerous.
But running the new script means that the timing to read a configuration file should be changed. That's why I wonder that lnet configuration will be the same as expected when "lustre_routes_config" or something like that runs on a router node after "lctl network up". Because LNetNIInit() checks "im_a_router" there and, I guess, it may have some effects on interpreting configuration.
Actually, I'm not so sure that it can be problem but I'll appreciate if you answer my question, or consider the case if you've never considered it.

Thank you

Comment by Amir Shehata (Inactive) [ 12/Apr/13 ]

Hi Hiroya,

My apologies for the delay, for some reason I wasn't getting any notifications on Jira of comments being added. I will look into that.

I have looked at the case that you have indicated; and you are absolutely right. The im_a_router flag will not be set when initially LNetNIInit() is called, and that will cause the "forwarding" parameter not to be checked, resulting in an lnet router node not behaving as expected. Thanks for pointing this out.

There are a couple of way to approach this:

1. Is to ensure that for a router node, in the luster.conf you enter a routes module param entry that corresponds to the NID of the router. This way when the routes param is being parsed a router node will set the "im_a_router" param properly and thus allocate the pools. The script: "lustre_routes_config" can be used later on to add extra routing entries as desired.
This solution would be temporary until a feature we have in the pipe, DLC, lands. This will allow the modifications of lnet configurations dynamically.

2. Is to modify the patch to ensure that when routes are added, if a route NID is added which corresponds to the local NID, then im_a_route is set, and immediately the pools are created.

Please let me know your thoughts on these approaches.

Thanks again for catching this case.

amir

Comment by Amir Shehata (Inactive) [ 12/Apr/13 ]

Hi Hiroya,

A note to highlight:

The DLC (Dynamic Lnet Configuration) will be backwards compatible. (IE, existing configurations will not need to be changed, when DLC is introduced)

thanks
amir

Comment by Amir Shehata (Inactive) [ 24/Apr/13 ]

I have updated review:
http://review.whamcloud.com/#change,5757
to add the man pages and
http://review.whamcloud.com/#change,5965
for the testing portion
If you guys can review it, that would be great.

Comment by Ryan Haasken [ 07/Aug/14 ]

Since this change added the ability to configure LNet routes from a file, and updated the lnet init script to load that config file at start time, I think the reload and probe actions should be updated so that they will reload the configuration from the file, if necessary.

From the LSB, this is what the reload action should do:

cause the configuration of the service to be reloaded without actually stopping and restarting the service

Similarly probe should do the following (according to SUSE, at least):

probe — Probes for the necessity of a reload. Depending on the service, prints "reload" or "restart" if a reload is required. Nothing is printed if a reload is not required.

I am curious about this because I am working on porting at least the lnet init script to SLES. Amir, am I right to think that the lnet init script should call lustre_routes_config during the reload operation?

Generated at Sat Feb 10 01:29:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.