Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17544

with lock_prim_nid=1 it seems to be possible that an unreachable nid gets primary nid

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.4
    • None
    • server 2.12.9

      server side
      lnet networks=o2ib2(op0),tcp1(ens6)

      client side
      lnet networks=o2ib2(op0)

    • 2
    • 9223372036854775807

    Description

      we see after upgrade to client 2.15.4 (from 2.15.3) that for 2 HA couples,

      half of the OSTs can not be accessed.

      they do not show up in lfs df,

      they show up as UP in lctl dl

       

      in lnetctl peer we see

          - primary nid: 10.84.200.32@tcp1   <<<<<<<
            Multi-Rail: True
            peer ni:
              - nid: 10.85.200.32@o2ib2
                state: NA
              - nid: 10.84.200.32@tcp1
                state: NA
          - primary nid: 10.85.200.33@o2ib2 
            Multi-Rail: True
            peer ni:
              - nid: 10.85.200.33@o2ib2
                state: NA
              - nid: 10.84.200.33@tcp1
                state: NA

      the client can not reach tcp1 network of the server, but that is selected as primary nid.

       

      I can either delete the nid with lnetctl peer del to make lfs df show all OSTs,

      or I can use

      lnet lock_prim_nid=0

      to make it work.

       

      That hints towards LU-14668, I would also verify that with a git bisect to 6cfc8e55a2e77c9c91b81a8842e2cbd886025298

       

      That seems to be strange that a non reachable NID can be primary NID, is that intended?

       

       

      Attachments

        Issue Links

          Activity

            People

              pjones Peter Jones
              hberger Holger Berger
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: