Details
-
Question/Request
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6
-
9223372036854775807
Description
Hello,
We have been having an issue with Fir recently with LNet routers. I'm still in the process of understanding what's wrong so I'm not opening this as a bug but rather a question.
Context: Our Fir Lustre storage system (EDR o2ib7) is connected to the two IB fabrics of Sherlock, the old one FDR o2ib6, the new one EDR o2ib4, through 4 LNet routers on each fabric (8 rtr total).
The EDR/EDR routers seem fine.
However, on the FDR/EDR side, we found 3 out of 4 routers in the following state, out of rtr credits to the Fir servers and unable to answer to lctl ping anymore:
command: egrep 'nid|o2ib7' /sys/kernel/debug/lnet/peers --------------- sh-rtr-fir-1-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 159 8 8 8 8 7 0 10.0.10.105@o2ib7 535 up 47 8 -526 -526 8 -4 0 10.0.10.102@o2ib7 172 up 47 8 -163 -341 8 -10 0 10.0.10.107@o2ib7 481 up 47 8 -472 -472 8 -16 0 10.0.10.52@o2ib7 65 up 46 8 -56 -152 8 -101 0 10.0.10.104@o2ib7 537 up 47 8 -528 -528 8 -4 0 10.0.10.101@o2ib7 473 up 47 8 -464 -464 8 -5 0 10.0.10.106@o2ib7 473 up 47 8 -464 -464 8 -8 0 10.0.10.51@o2ib7 218 up 30 8 -209 -209 8 -1311 0 10.0.10.103@o2ib7 201 up 47 8 -192 -344 8 -6 0 10.0.10.108@o2ib7 537 up 47 8 -528 -528 8 -1 0 ---------------
10.0.10.51 and 10.0.10.52 are Fir's MDS and 10.0.10.[101-108] are Fir's OSS.
This is currently impacting sh-rtr-fir-1-[2-4] so we shut them down. If I put them back online, the same thing happens again. sh-rtr-fir-1-1 is the only router that seems to be working right now. Because of a high number of jobs working with small files are currently running on Sherlock, this router is in the following state:
[root@sh-rtr-fir-1-1 ~]# egrep 'nid|o2ib7' /sys/kernel/debug/lnet/peers nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 137 8 8 8 8 7 0 10.0.10.105@o2ib7 3 up 68 8 6 -8 8 -58 0 10.0.10.102@o2ib7 4 up 70 8 5 -8 8 -69 0 10.0.10.107@o2ib7 1 up 70 8 8 -16 8 -95 0 10.0.10.52@o2ib7 112 up 58 8 8 -8 -103 -440 60368 10.0.10.104@o2ib7 2 up 70 8 7 -8 8 -34 0 10.0.10.101@o2ib7 2 up 67 8 7 -16 8 -47 0 10.0.10.106@o2ib7 1 up 69 8 8 -16 8 -68 0 10.0.10.51@o2ib7 1 up 70 8 8 -8 8 -869 0 10.0.10.103@o2ib7 6 up 68 8 3 -8 8 -39 0 10.0.10.108@o2ib7 1 up 67 8 8 -8 8 -72 0
It's running just fine, even though running out of tx peer credits for MDS 10.0.10.52 (fir-md1-s2) so there is likely some added latency here.
NIS seems OK:
[root@sh-rtr-fir-1-1 ~]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.8.0.26@o2ib6 up 0 63 8 0 128 127 76 10.8.0.26@o2ib6 up 0 67 8 0 128 128 78 10.0.10.201@o2ib7 up 0 134 8 0 128 120 88 10.0.10.201@o2ib7 up 0 4 8 0 128 128 101
The job running is a matlab jobs working with small files that all fit within our 128kb DoM stripe:
[root@sh-109-03 ~]# ls /fir/users/jiangjq/cache/Efficiency/RandomDeform_trainset_01/batch95080 -l total 18334 -rw-r--r-- 1 jiangjq jonfan 28719 Jun 18 10:44 GAN95080.txt -rw-r--r-- 1 jiangjq jonfan 1032 Jun 18 00:06 GenNode95080.bash -rw-r--r-- 1 jiangjq jonfan 322 Jun 18 00:28 ret8I3O0.mat -rw-r--r-- 1 jiangjq jonfan 7029 Jun 18 00:27 ret8I3O100.mat -rw-r--r-- 1 jiangjq jonfan 6937 Jun 18 00:27 ret8I3O101.mat -rw-r--r-- 1 jiangjq jonfan 7036 Jun 18 00:27 ret8I3O102.mat -rw-r--r-- 1 jiangjq jonfan 7061 Jun 18 00:27 ret8I3O103.mat -rw-r--r-- 1 jiangjq jonfan 7037 Jun 18 00:27 ret8I3O104.mat -rw-r--r-- 1 jiangjq jonfan 6702 Jun 18 00:27 ret8I3O105.mat -rw-r--r-- 1 jiangjq jonfan 6843 Jun 18 00:27 ret8I3O106.mat -rw-r--r-- 1 jiangjq jonfan 7038 Jun 18 00:27 ret8I3O107.mat -rw-r--r-- 1 jiangjq jonfan 7020 Jun 18 00:27 ret8I3O108.mat -rw-r--r-- 1 jiangjq jonfan 7022 Jun 18 00:27 ret8I3O109.mat -rw-r--r-- 1 jiangjq jonfan 60976 Jun 18 00:25 ret8I3O10.mat ...
strace on the job processes show very tiny reads (this is awful):
[pid 100163] read(125, ".\0\0\0\1\0\0\0001\366\273\362\2\0\0\0\0*\6@\321\177\0\0\211\235\4\0E\315\3\0"..., 46) = 46 [pid 418362] read(100, "3\0\0\0\1\0\0\0002\366\273\362\2\0\0\0P\251\3\260\321\177\0\0\211\235\4\0E\315\3\0"..., 51) = 51 [pid 108791] read(114, "A\0\0\0\1\0\0\0003\366\273\362\2\0\0\0PG\6L\321\177\0\0\211\235\4\0E\315\3\0"..., 65) = 65 [pid 242206] read(127, "4\0\0\0\1\0\0\0004\366\273\362\2\0\0\0\240\304\305\264\321\177\0\0\211\235\4\0E\315\3\0"..., 52) = 52 [pid 455676] read(90, "5\0\0\0\1\0\0\0005\366\273\362\2\0\0\0\300(\t$\321\177\0\0\211\235\4\0E\315\3\0"..., 53) = 53 [pid 345394] read(28, ".\0\0\0\1\0\0\0006\366\273\362\2\0\0\0\1\0\0\0\0\0\0\0\211\235\4\0E\315\3\0"..., 46) = 46 [pid 205130] read(20, "0\0\0\0\1\0\0\0007\366\273\362\2\0\0\0\260\31\6\254\320\177\0\0\211\235\4\0E\315\3\0"..., 48) = 48 [pid 248245] read(139, ".\0\0\0\1\0\0\0008\366\273\362\2\0\0\0\0*\6@\321\177\0\0\211\235\4\0E\315\3\0"..., 46) = 46
So this is generating a high load on the MDS with DoM. We've ordered two more MDSes so that we can run 1 MDT / MDS on Fir instead of 2 MDTs/MDS, which I think would help with DoM, but they are not installed yet.
While I understand the RPC load with DoM, I do not explain why 3 out of 4 routers are running out of rtr credits. I'm still investigating at the IB fabric level. Any suggestions would be appreciated on how we can troubleshoot things here. Thanks!