Details
-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
None
-
Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
This commit has caused a serious regression on master where clients are unable to mount a filesystem under certain LNet configurations (namely routed ones):
commit 024f9303bc6f32a3113357c864765c4f9c93ed03 Author: Amir Shehata <ashehata@whamcloud.com> Date: Wed May 5 11:35:06 2021 -0700 LU-14668 lnet: Lock primary NID logic
I believe this should be reverted and the patches for LU-14668 (which is still open) should be re-worked.
Some additional detail on the bug:
The aforementioned commit will break any routed configuration where the clients mount the filesystem using non-primary NIDs. For example:
MGS
10.16.100.52@o2ib 10.16.100.53@o2ib 10.16.100.52@o2ib10 10.16.100.53@o2ib10
Clients have routes to the o2ib10 network, so they mount using something like:
mount -t lustre 10.16.100.52@o2ib10,10.16.100.53@o2ib10:/lustre ...
LNetPrimaryNID() on the client returns 10.16.100.52@o2ib10 as the primary NID (because of https://review.whamcloud.com/43563/ ), so client sets up ptlrpc connection using this NID. But incoming messages from the MGS have the actual primary NID, 10.16.100.52@o2ib. So they do not match and the incoming messages get dropped. This prevents the client from being able to mount.
walleye-p5:~ # !grep grep lustre /etc/fstab 10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 /lus/kjcf05 lustre rw,flock,lazystatfs,noauto 0 0 walleye-p5:~ # mount /lus/kjcf05 mount.lustre: mount 10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 at /lus/kjcf05 failed: Input/output error Is the MGS running? walleye-p5:~ #
If I revert https://review.whamcloud.com/43563 then I'm able to mount:
walleye-p5:~ # mount /lus/kjcf05 walleye-p5:~ # lfs check servers kjcf05-OST0000-osc-ffff8888361cd000 active. kjcf05-OST0001-osc-ffff8888361cd000 active. kjcf05-OST0002-osc-ffff8888361cd000 active. kjcf05-OST0003-osc-ffff8888361cd000 active. kjcf05-MDT0000-mdc-ffff8888361cd000 active. kjcf05-MDT0001-mdc-ffff8888361cd000 active. MGC10.16.100.52@o2ib10 active. walleye-p5:~ #
I think the regression doesn't strictly apply to routed configurations, but any client mount where the client's initial connection attempt goes to a non-primary NID. This would be typical for routed clients. Not so much with direct connect, but it is possible there too (like with multi-homed servers)
Attachments
Issue Links
- is related to
-
LU-14668 LNet: do discovery in the background
- Resolved