Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12274

Clients aren't connecting to OST defined failover.node

Details

    • Question/Request
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.10.2
    • None
    • Servers: Lustre-2.10.2, Kernel: 3.10.0-693.5.2.el7_lustre.x86_64
      Clients: Lustre-2.10.3, Kernel: 3.10.0-693.21.1.el7.x86_64
      Client/Server OS: CentOS Linux release 7.4.1708

    Description

      I tried running tunefs.lustre and successfully changed the failover NIDs to what they should be. This problem is happening on several OSTs, but fixing one should fix them all.

      I'm assuming I forgot a step when I ran tunefs.lustre.

      tunefs.lustre --erase-param failover.node --param failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 /dev/mapper/mpathg

      The OST OST0017 is mounted on 172.17.1.103 with the following parameters:

      [root@apslstr03 ~]# tunefs.lustre --dryrun /dev/mapper/mpathg
      checking for existing Lustre data: found
      Reading CONFIGS/mountdata

         Read previous values:
      Target:     lustrefc-OST0017
      Index:      23
      Lustre FS:  lustrefc
      Mount type: ldiskfs
      Flags:      0x2
                    (OST )
      Persistent mount opts: ,errors=remount-ro
      Parameters:  failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 mgsnode=172.17.1.112@o2ib,172.16.1.112@tcp1 mgsnode=172.17.1.113@o2ib,172.16.1.113@tcp1

         Permanent disk data:
      Target:     lustrefc-OST0017
      Index:      23
      Lustre FS:  lustrefc
      Mount type: ldiskfs
      Flags:      0x2
                    (OST )
      Persistent mount opts: ,errors=remount-ro
      Parameters:  failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 mgsnode=172.17.1.112@o2ib,172.16.1.112@tcp1 mgsnode=172.17.1.113@o2ib,172.16.1.113@tcp1

      exiting before disk write.
      [root@apslstr03 ~]#

      However, the clients are still displaying errors like this:

      May  8 11:43:33 localhost kernel: Lustre: 2028:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1557333772/real 0] req@ffff880bd9296f00 x1632920191594624/t0(0) o8->lustrefc-OST0017-osc-ffff8817ef372000@172.17.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1557333813 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      May  8 11:43:33 localhost kernel: Lustre: 2028:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 65 previous similar messages
      May  8 11:45:26 localhost kernel: LNet: 1994:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.17.1.106@o2ib: 3 seconds
      May  8 11:45:26 localhost kernel: LNet: 1994:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 39 previous similar messages

      Attachments

        Activity

          [LU-12274] Clients aren't connecting to OST defined failover.node
          rs1 Roger Sersted made changes -
          Attachment New: lustrefc-client.txt [ 32558 ]
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Sebastien Buisson [ sebastien ]
          rs1 Roger Sersted created issue -

          People

            sebastien Sebastien Buisson
            rs1 Roger Sersted
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: