Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11888

Unreachable client NID confusing Lustre 2.12

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6
    • 9223372036854775807

    Description

      Just wanted to report this, although probably not critical. During testing of 2.12.0 on IB only (o2ib with routers), we mistakenly set up a client with two NIDs, one on tcp0:

      [root@sh-06-33 ~]# lctl list_nids
      10.10.6.33@tcp
      10.8.6.33@o2ib6
      

      This confused the Lustre servers a LOT:

      [663974.083382] LNetError: 124939:0:(peer.c:2480:lnet_peer_merge_data()) Error deleting NID 10.10.6.33@tcp from peer 10.10.6.33@tcp: -16
      [663974.095393] Lustre: MGS: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
      [663981.577418] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) no route to 10.10.6.33@tcp from <?>
      [663981.588032] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) Skipped 6171721 previous similar messages
      [663981.599239] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548442345/real 1548442345]  req@ffff9149c37fb600 x1623580872768976/t0(0) o104->fir-MDT0000@10.10.6.33@tcp:15/16 lens 296/224 e 0 to 1 dl 1548442356 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
      [663981.626855] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6167666 previous similar messages
      [664132.508056] LustreError: 127396:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.10.6.33@tcp) failed to reply to blocking AST (req@ffff9149c37fb600 x1623580872768976 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 109 pid: 125204 timeout: 664264 lvb_type: 0
      [664132.550562] LustreError: 138-a: fir-MDT0000: A client on nid 10.10.6.33@tcp was evicted due to a lock blocking callback time out: rc -110
      [664132.563014] LustreError: 125084:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.10.6.33@tcp  ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 110 pid: 125204 timeout: 0 lvb_type: 0
      

      But the main problem is that and even after the client rebooted without the tcp0 NID, the server was still logging things like:

      [664150.993807] Lustre: fir-MDT0000: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
      

      Re: this last line, it was after the client has rebooted. While it looks like the server only prints the first client NID, but in that case it remembered the last client's tcp0 nid, which is weird...

      The servers are using o2ib only:

      [root@fir-md1-s1 ~]# lctl list_nids
      10.0.10.51@o2ib7
      [root@fir-md1-s1 ~]# lctl route_list
      net              o2ib4 hops 4294967295 gw                10.0.10.210@o2ib7 up pri 0
      net              o2ib4 hops 4294967295 gw                10.0.10.209@o2ib7 up pri 0
      net              o2ib4 hops 4294967295 gw                10.0.10.211@o2ib7 up pri 0
      net              o2ib4 hops 4294967295 gw                10.0.10.212@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.202@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.204@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.201@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.203@o2ib7 up pri 0
      

      We were wondering how it is even possible. The solution to fix this in a timely manner was to restart the Lustre servers.

      Stephane

      Attachments

        Issue Links

          Activity

            People

              sharmaso Sonia Sharma (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: