Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.13.0, Lustre 2.12.2
-
version=2.12.53_13_g4191e0c
-
3
-
9223372036854775807
Description
soak has been running on master branch version 2.12.53_13_g4191e0c for about 2 days, no crash, but many applications failed 511 fail /956 pass. From the syslog, seems caused by network issue. The first 24 hours seems good, failure rate is similar to 2.12.1, but as the test went by, applications started to fail a lot.
Some error msg seems similar as LU-12065 which has already been fixed in this version.
[root@soak-16 syslog]# grep -r "Async QP" soak-20.log:May 8 06:43:29 soak-20 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-35.log:May 8 06:42:10 soak-35 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-36.log:May 8 06:41:43 soak-36 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-17.log:May 8 06:42:27 soak-17 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-38.log:May 8 06:42:09 soak-38 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-40.log:May 8 06:41:49 soak-40 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 [root@soak-16 syslog]#
many of following errors showed in client syslog
May 8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:343:ptlrpc_invalidate_import()) soaked-OST0009_UUID: rc = -110 waiting for callback (6 != 0) May 8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:369:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff94340487a400 x1632824181437936/t0(0) o4->soaked-OST0009-osc-ffff943a9be9a800@192.168.1.105@o2ib:6/4 lens 488/448 e 0 to 0 dl 1557297784 ref 2 fl UnregBULK:ES/0/ffffffff rc -5/-1 May 8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:383:ptlrpc_invalidate_import()) soaked-OST0009_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out.