[LU-10825] Configuring multi-rail with a large number of nodes Created: 19/Mar/18  Updated: 20/Aug/18  Resolved: 02/May/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: Lustre 2.12.0, Lustre 2.10.5

Type: Improvement Priority: Minor
Reporter: Taizeng Wu Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

Recently, we prepare to deployment a lustre with multi-rail, but i don't known how to enable dynamic discovery.

We use lustre-2.10.3, it seems dynamic discovery is implementd in version 2.11.

We have about 2 mgs/mds, 6 oss and 512 client nodes, how to configure static multi-rail with a large number of nodes ?



 Comments   
Comment by Amir Shehata (Inactive) [ 19/Mar/18 ]

Can you explain to me your MR deployment? Do all nodes have multiple interfaces? or the clients only? servers only?

Comment by Taizeng Wu [ 20/Mar/18 ]

All nodes have two interfaces (1 Mellanox FDR card with two port).

I am trying to configure remote peer to include oss and clients on the mds/mgs nodes, remote peer to include mds/mgs and client on the oss nodes, remote peer to include mds/mgs and oss on the client nodes. Is this configuration correct ?

Then i mkfs or mount lustre to use MR's primary nid.

When i turn down a interface on server node, i found lustre filesystem hung sometimes. 

This issue may be caused by ARP ( http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node).

I follow the ARP flux guide to configure, but lustre filesystem still hung sometimes when turn down a interface (dmesg report "Request set has failed due to network error" about node which turn down a interface).

Servers OS Version: RHEL 7.4

Clients OS Version: RHEL 6.8

 

Comment by Amir Shehata (Inactive) [ 20/Mar/18 ]

I'm currently working on a patch to make it easier to configure large systems without Dynamic Discovery.

But for now you'll need to configure the servers to know about the client's interfaces and you'll need to configure the clients to know about the server's interfaces. And since the OSS/MGS communicate you'll need to configure these to know about each other's interfaces.

MR doesn't handle interface down cases. If you intentionally (or unintentionally) bring down an interface, it will interfere with the file system operations as you've seen. We're currently working on a feature, LNet Health, which will be able to handle this particular interface failures.

Comment by Gerrit Updater [ 27/Mar/18 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31785
Subject: LU-10825 libcfs: generate ip addresses
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eb88bfc56451ad790445daf1d5be303915b596b9

Comment by Gerrit Updater [ 27/Mar/18 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31786
Subject: LU-10825 lnet: add ip2nets syntax handling for peer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3edcd387af4028e93f5d4df9caed9a36539ccbf6

Comment by Gerrit Updater [ 02/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31785/
Subject: LU-10825 libcfs: generate ip addresses
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c5f788397213aa41356df1f96f7ade58653973a

Comment by Gerrit Updater [ 02/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31786/
Subject: LU-10825 lnet: add ip2nets syntax handling for peer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 70c95457f6836a9c0a9e95ae0c4bdd20f99a8747

Comment by Peter Jones [ 02/May/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 02/May/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32249
Subject: LU-10825 libcfs: generate ip addresses
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: a22011c6d2cd804413b5f7e8353b687fb742a495

Comment by Gerrit Updater [ 02/May/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32250
Subject: LU-10825 lnet: add ip2nets syntax handling for peer
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: fc78b0ee95a2ee85121e84e5d104f5c268aae26f

Comment by Gerrit Updater [ 01/Aug/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32249/
Subject: LU-10825 libcfs: generate ip addresses
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: a48dc3fd0f738b545571a6d2cfdeb337f2d3243b

Comment by Gerrit Updater [ 01/Aug/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32250/
Subject: LU-10825 lnet: add ip2nets syntax handling for peer
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: e124f39b6b4dd56780ba4490b81dca32ab08575c

Generated at Sat Feb 10 02:38:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.