Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14589

LNet routers don't reset peers after they reboot

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      We are now using 2.14 LNet routers, with 2.12.6 servers and a mix of 2.13 and 2.14 clients, and it looks like routers are less resilient to node client failures.

      We had two peers (clients) running Lustre 2.13 that were in a bad state (RDMA timeout, but no apparent IB fabric problem), and our routers (2.14) had a lot of refs for them. These are peers 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2 below. We rebooted them, but even after that, we couldn't mount the filesystem, likely because the routers still had old references?

      This is after reboot of 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2:

      [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
      nid                      refs state  last   max   rtr   min    tx   min queue
      10.50.7.38@o2ib2            3    up    -1     8     6    -8     8   -39 0
      10.50.15.11@o2ib2          17    up    -1     8    -8    -8     8   -43 0
      10.50.5.55@o2ib2           14    up    -1     8    -5    -8     8   -16 0
      10.50.14.13@o2ib2          17    up    -1     8    -8    -8     8   -19 0
      10.50.5.68@o2ib2           17    up    -1     8    -8    -8     8   -27 0
      10.50.5.47@o2ib2           17    up    -1     8    -8    -8     8   -45 0
      10.50.5.60@o2ib2           17    up    -1     8    -8    -8     8   -61 0
      10.50.16.6@o2ib2            4    up    -1     8     5    -8     8   -53 0
      10.50.4.41@o2ib2            7    up    -1     8     2    -8     8   -15 0
      10.50.14.10@o2ib2           8    up    -1     8     1    -8     8   -21 0
      10.50.12.14@o2ib2        29128    up    -1     8 -29119 -29119     8   -82 0    <<<<
      10.50.1.60@o2ib2            8    up    -1     8     1    -8     8   -20 0
      10.50.13.4@o2ib2            3    up    -1     8     6    -8     8   -20 0
      10.50.14.15@o2ib2           7    up    -1     8     2    -8     8   -28 0
      10.50.1.18@o2ib2            3    up    -1     8     6    -8     8   -13 0
      10.50.15.5@o2ib2           17    up    -1     8    -8    -8     8   -20 0
      10.50.7.3@o2ib2             3    up    -1     8     6    -8     8   -32 0
      10.50.13.14@o2ib2           4    up    -1     8     5    -8     8   -37 0
      10.50.5.67@o2ib2            4    up    -1     8     5    -8     8   -28 0
      10.50.10.41@o2ib2           3    up    -1     8     6    -8     8   -18 0
      10.50.0.64@o2ib2            5    up    -1     8     4   -24     8   -75 0
      10.50.5.38@o2ib2            3    up    -1     8     6    -8     8   -33 0
      10.50.13.11@o2ib2          15    up    -1     8    -6    -8     8   -28 0
      10.50.5.9@o2ib2             4    up    -1     8     5    -8     8   -20 0
      10.50.12.13@o2ib2           4    up    -1     8     5    -8     8   -39 0
      10.50.5.43@o2ib2            4    up    -1     8     5    -8     8   -24 0
      10.50.1.59@o2ib2            6    up    -1     8     3    -8     8   -20 0
      10.50.12.5@o2ib2            3    up    -1     8     6   -40     8   -52 0
      10.50.16.2@o2ib2           16    up    -1     8    -7    -8     8   -25 0
      10.50.4.8@o2ib2            12    up    -1     8    -3   -40     8   -37 0
      10.50.15.9@o2ib2            8    up    -1     8     1    -8     8   -27 0
      10.50.5.53@o2ib2           17    up    -1     8    -8    -8     8   -26 0
      10.50.5.32@o2ib2            3    up    -1     8     6    -8     8   -28 0
      10.50.13.13@o2ib2           6    up    -1     8     3    -8     8   -41 0
      10.0.2.114@o2ib5          203    up    -1     8     8    -8  -194 -1877 15190572
      10.50.12.15@o2ib2        36367    up    -1     8 -36358 -36358     8   -76 0      <<<
      10.50.10.6@o2ib2            3    up    -1     8     6    -8     8   -23 0
      10.50.15.1@o2ib2            3    up    -1     8     6    -8     8   -19 0
      10.50.8.44@o2ib2            3    up    -1     8     6    -8     8   -10 0
      

      The routers were shown as up from both peers, but the filesystem couldn't be mounted.

      We manually deleted both peers from the routers:

      [root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.15@o2ib2
      [root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.14@o2ib2
      

      The situation was much better after that in terms of peers:

      [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
      nid                      refs state  last   max   rtr   min    tx   min queue
      [root@sh02-oak02 ~]# 
      

      And we were able to mount the filesystem on these two peers without problem.

      But I noticed that the refs have not been cleaned from the nis file:

      [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis 
      nid                      status alive refs peer  rtr   max    tx   min
      0@lo                         up     0    2    0    0     0     0     0
      10.50.0.132@o2ib2            up     0 65815    8    0   256   256   195
      10.0.2.215@o2ib5             up     0  225    8    0   256   248   179
      

       

      Our config:

      Routers:

      [root@sh02-oak02 ~]# lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 50
          health_sensitivity: 0
          recovery_interval: 1
          router_sensitivity: 100
          lnd_timeout: 49
          response_tracking: 3
          recovery_limit: 0
      

      Clients:

      [root@sh02-12n14 ~]# lnetctl global show -v 3
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 50
          health_sensitivity: 0
          recovery_interval: 1
          router_sensitivity: 100
          lnd_timeout: 49
          response_tracking: 3
          recovery_limit: 0
      

      Attachments

        1. sh02-fir02.dk-lnet.log.gz
          4.27 MB
        2. sh02-fir02.netshow.txt
          4 kB
        3. sh02-fir02.peers.txt
          63 kB
        4. sh02-fir02.peershow.all.txt
          906 kB
        5. sh02-fir02.peershow.all.txt
          906 kB
        6. sh02-oak02_peers.txt
          63 kB

        Activity

          People

            ashehata Amir Shehata (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: