Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15169

Regression in "024f9303bc LU-14668 lnet: Lock primary NID logic" breaks client mounts

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      This commit has caused a serious regression on master where clients are unable to mount a filesystem under certain LNet configurations (namely routed ones):

      commit 024f9303bc6f32a3113357c864765c4f9c93ed03
      Author: Amir Shehata <ashehata@whamcloud.com>
      Date:   Wed May 5 11:35:06 2021 -0700
      
          LU-14668 lnet: Lock primary NID logic
      

      I believe this should be reverted and the patches for LU-14668 (which is still open) should be re-worked.

      Some additional detail on the bug:

      The aforementioned commit will break any routed configuration where the clients mount the filesystem using non-primary NIDs. For example:

      MGS

      10.16.100.52@o2ib
      10.16.100.53@o2ib
      10.16.100.52@o2ib10
      10.16.100.53@o2ib10
      

      Clients have routes to the o2ib10 network, so they mount using something like:

      mount -t lustre 10.16.100.52@o2ib10,10.16.100.53@o2ib10:/lustre ...
      

      LNetPrimaryNID() on the client returns 10.16.100.52@o2ib10 as the primary NID (because of https://review.whamcloud.com/43563/ ), so client sets up ptlrpc connection using this NID. But incoming messages from the MGS have the actual primary NID, 10.16.100.52@o2ib. So they do not match and the incoming messages get dropped. This prevents the client from being able to mount.

      walleye-p5:~ # !grep
      grep lustre /etc/fstab
      10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 /lus/kjcf05 lustre rw,flock,lazystatfs,noauto 0 0
      walleye-p5:~ # mount /lus/kjcf05
      mount.lustre: mount 10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 at /lus/kjcf05 failed: Input/output error
      Is the MGS running?
      walleye-p5:~ #
      

      If I revert https://review.whamcloud.com/43563 then I'm able to mount:

      walleye-p5:~ # mount /lus/kjcf05
      walleye-p5:~ # lfs check servers
      kjcf05-OST0000-osc-ffff8888361cd000 active.
      kjcf05-OST0001-osc-ffff8888361cd000 active.
      kjcf05-OST0002-osc-ffff8888361cd000 active.
      kjcf05-OST0003-osc-ffff8888361cd000 active.
      kjcf05-MDT0000-mdc-ffff8888361cd000 active.
      kjcf05-MDT0001-mdc-ffff8888361cd000 active.
      MGC10.16.100.52@o2ib10 active.
      walleye-p5:~ #
      

      I think the regression doesn't strictly apply to routed configurations, but any client mount where the client's initial connection attempt goes to a non-primary NID. This would be typical for routed clients. Not so much with direct connect, but it is possible there too (like with multi-homed servers)

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: