Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12451

Out of router peer credits with DoM?

Details

    • Question/Request
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6
    • 9223372036854775807

    Description

      Hello,

      We have been having an issue with Fir recently with LNet routers. I'm still in the process of understanding what's wrong so I'm not opening this as a bug but rather a question.

      Context: Our Fir Lustre storage system (EDR o2ib7) is connected to the two IB fabrics of Sherlock, the old one FDR o2ib6, the new one EDR o2ib4, through 4 LNet routers on each fabric (8 rtr total).

      The EDR/EDR routers seem fine.

      However, on the FDR/EDR side, we found 3 out of 4 routers in the following state, out of rtr credits to the Fir servers and unable to answer to lctl ping anymore:

      command: egrep 'nid|o2ib7' /sys/kernel/debug/lnet/peers
      ---------------
      sh-rtr-fir-1-4
      ---------------
      nid                      refs state  last   max   rtr   min    tx   min queue
      10.0.10.3@o2ib7             1    up   159     8     8     8     8     7 0
      10.0.10.105@o2ib7         535    up    47     8  -526  -526     8    -4 0
      10.0.10.102@o2ib7         172    up    47     8  -163  -341     8   -10 0
      10.0.10.107@o2ib7         481    up    47     8  -472  -472     8   -16 0
      10.0.10.52@o2ib7           65    up    46     8   -56  -152     8  -101 0
      10.0.10.104@o2ib7         537    up    47     8  -528  -528     8    -4 0
      10.0.10.101@o2ib7         473    up    47     8  -464  -464     8    -5 0
      10.0.10.106@o2ib7         473    up    47     8  -464  -464     8    -8 0
      10.0.10.51@o2ib7          218    up    30     8  -209  -209     8 -1311 0
      10.0.10.103@o2ib7         201    up    47     8  -192  -344     8    -6 0
      10.0.10.108@o2ib7         537    up    47     8  -528  -528     8    -1 0
      ---------------
      

      10.0.10.51 and 10.0.10.52 are Fir's MDS and 10.0.10.[101-108] are Fir's OSS.

      This is currently impacting sh-rtr-fir-1-[2-4] so we shut them down. If I put them back online, the same thing happens again. sh-rtr-fir-1-1 is the only router that seems to be working right now. Because of a high number of jobs working with small files are currently running on Sherlock, this router is in the following state:

      [root@sh-rtr-fir-1-1 ~]# egrep 'nid|o2ib7' /sys/kernel/debug/lnet/peers
      nid                      refs state  last   max   rtr   min    tx   min queue
      10.0.10.3@o2ib7             1    up   137     8     8     8     8     7 0
      10.0.10.105@o2ib7           3    up    68     8     6    -8     8   -58 0
      10.0.10.102@o2ib7           4    up    70     8     5    -8     8   -69 0
      10.0.10.107@o2ib7           1    up    70     8     8   -16     8   -95 0
      10.0.10.52@o2ib7          112    up    58     8     8    -8  -103  -440 60368
      10.0.10.104@o2ib7           2    up    70     8     7    -8     8   -34 0
      10.0.10.101@o2ib7           2    up    67     8     7   -16     8   -47 0
      10.0.10.106@o2ib7           1    up    69     8     8   -16     8   -68 0
      10.0.10.51@o2ib7            1    up    70     8     8    -8     8  -869 0
      10.0.10.103@o2ib7           6    up    68     8     3    -8     8   -39 0
      10.0.10.108@o2ib7           1    up    67     8     8    -8     8   -72 0
      

      It's running just fine, even though running out of tx peer credits for MDS 10.0.10.52 (fir-md1-s2) so there is likely some added latency here.

      NIS seems OK:

      [root@sh-rtr-fir-1-1 ~]# cat /sys/kernel/debug/lnet/nis
      nid                      status alive refs peer  rtr   max    tx   min
      0@lo                       down     0    2    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      10.8.0.26@o2ib6              up     0   63    8    0   128   127    76
      10.8.0.26@o2ib6              up     0   67    8    0   128   128    78
      10.0.10.201@o2ib7            up     0  134    8    0   128   120    88
      10.0.10.201@o2ib7            up     0    4    8    0   128   128   101
      

      The job running is a matlab jobs working with small files that all fit within our 128kb DoM stripe:

      [root@sh-109-03 ~]# ls /fir/users/jiangjq/cache/Efficiency/RandomDeform_trainset_01/batch95080 -l
      total 18334
      -rw-r--r-- 1 jiangjq jonfan   28719 Jun 18 10:44 GAN95080.txt
      -rw-r--r-- 1 jiangjq jonfan    1032 Jun 18 00:06 GenNode95080.bash
      -rw-r--r-- 1 jiangjq jonfan     322 Jun 18 00:28 ret8I3O0.mat
      -rw-r--r-- 1 jiangjq jonfan    7029 Jun 18 00:27 ret8I3O100.mat
      -rw-r--r-- 1 jiangjq jonfan    6937 Jun 18 00:27 ret8I3O101.mat
      -rw-r--r-- 1 jiangjq jonfan    7036 Jun 18 00:27 ret8I3O102.mat
      -rw-r--r-- 1 jiangjq jonfan    7061 Jun 18 00:27 ret8I3O103.mat
      -rw-r--r-- 1 jiangjq jonfan    7037 Jun 18 00:27 ret8I3O104.mat
      -rw-r--r-- 1 jiangjq jonfan    6702 Jun 18 00:27 ret8I3O105.mat
      -rw-r--r-- 1 jiangjq jonfan    6843 Jun 18 00:27 ret8I3O106.mat
      -rw-r--r-- 1 jiangjq jonfan    7038 Jun 18 00:27 ret8I3O107.mat
      -rw-r--r-- 1 jiangjq jonfan    7020 Jun 18 00:27 ret8I3O108.mat
      -rw-r--r-- 1 jiangjq jonfan    7022 Jun 18 00:27 ret8I3O109.mat
      -rw-r--r-- 1 jiangjq jonfan   60976 Jun 18 00:25 ret8I3O10.mat
      ...
      

      strace on the job processes show very tiny reads (this is awful):

      [pid 100163] read(125, ".\0\0\0\1\0\0\0001\366\273\362\2\0\0\0\0*\6@\321\177\0\0\211\235\4\0E\315\3\0"..., 46) = 46
      [pid 418362] read(100, "3\0\0\0\1\0\0\0002\366\273\362\2\0\0\0P\251\3\260\321\177\0\0\211\235\4\0E\315\3\0"..., 51) = 51
      [pid 108791] read(114, "A\0\0\0\1\0\0\0003\366\273\362\2\0\0\0PG\6L\321\177\0\0\211\235\4\0E\315\3\0"..., 65) = 65
      [pid 242206] read(127, "4\0\0\0\1\0\0\0004\366\273\362\2\0\0\0\240\304\305\264\321\177\0\0\211\235\4\0E\315\3\0"..., 52) = 52
      [pid 455676] read(90, "5\0\0\0\1\0\0\0005\366\273\362\2\0\0\0\300(\t$\321\177\0\0\211\235\4\0E\315\3\0"..., 53) = 53
      [pid 345394] read(28, ".\0\0\0\1\0\0\0006\366\273\362\2\0\0\0\1\0\0\0\0\0\0\0\211\235\4\0E\315\3\0"..., 46) = 46
      [pid 205130] read(20, "0\0\0\0\1\0\0\0007\366\273\362\2\0\0\0\260\31\6\254\320\177\0\0\211\235\4\0E\315\3\0"..., 48) = 48
      [pid 248245] read(139, ".\0\0\0\1\0\0\0008\366\273\362\2\0\0\0\0*\6@\321\177\0\0\211\235\4\0E\315\3\0"..., 46) = 46
      

      So this is generating a high load on the MDS with DoM. We've ordered two more MDSes so that we can run 1 MDT / MDS on Fir instead of 2 MDTs/MDS, which I think would help with DoM, but they are not installed yet.

      While I understand the RPC load with DoM, I do not explain why 3 out of 4 routers are running out of rtr credits. I'm still investigating at the IB fabric level. Any suggestions would be appreciated on how we can troubleshoot things here. Thanks!

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: