[LU-10825] Configuring multi-rail with a large number of nodes Created: 19/Mar/18 Updated: 20/Aug/18 Resolved: 02/May/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.5 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Taizeng Wu | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Recently, we prepare to deployment a lustre with multi-rail, but i don't known how to enable dynamic discovery. We use lustre-2.10.3, it seems dynamic discovery is implementd in version 2.11. We have about 2 mgs/mds, 6 oss and 512 client nodes, how to configure static multi-rail with a large number of nodes ? |
| Comments |
| Comment by Amir Shehata (Inactive) [ 19/Mar/18 ] |
|
Can you explain to me your MR deployment? Do all nodes have multiple interfaces? or the clients only? servers only? |
| Comment by Taizeng Wu [ 20/Mar/18 ] |
|
All nodes have two interfaces (1 Mellanox FDR card with two port). I am trying to configure remote peer to include oss and clients on the mds/mgs nodes, remote peer to include mds/mgs and client on the oss nodes, remote peer to include mds/mgs and oss on the client nodes. Is this configuration correct ? Then i mkfs or mount lustre to use MR's primary nid. When i turn down a interface on server node, i found lustre filesystem hung sometimes. This issue may be caused by ARP ( http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node). I follow the ARP flux guide to configure, but lustre filesystem still hung sometimes when turn down a interface (dmesg report "Request set has failed due to network error" about node which turn down a interface). — Servers OS Version: RHEL 7.4 Clients OS Version: RHEL 6.8
|
| Comment by Amir Shehata (Inactive) [ 20/Mar/18 ] |
|
I'm currently working on a patch to make it easier to configure large systems without Dynamic Discovery. But for now you'll need to configure the servers to know about the client's interfaces and you'll need to configure the clients to know about the server's interfaces. And since the OSS/MGS communicate you'll need to configure these to know about each other's interfaces. MR doesn't handle interface down cases. If you intentionally (or unintentionally) bring down an interface, it will interfere with the file system operations as you've seen. We're currently working on a feature, LNet Health, which will be able to handle this particular interface failures. |
| Comment by Gerrit Updater [ 27/Mar/18 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31785 |
| Comment by Gerrit Updater [ 27/Mar/18 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31786 |
| Comment by Gerrit Updater [ 02/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31785/ |
| Comment by Gerrit Updater [ 02/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31786/ |
| Comment by Peter Jones [ 02/May/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 02/May/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32249 |
| Comment by Gerrit Updater [ 02/May/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32250 |
| Comment by Gerrit Updater [ 01/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32249/ |
| Comment by Gerrit Updater [ 01/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32250/ |