Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.11.0
-
Cray CLE6 system running 2.11 clients with 2.11 servers.
-
3
-
9223372036854775807
Description
We have seen in a production system the following error which are causing clients to be evicted.
[85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds
[85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8
[123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995
For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:
F1 file system server backend with 2.8.2 stack, ldiskfs:
map_on_demand:0
concurrent_sends:0
peer_credits:8
F2 file system server 2.11 (ZFS 0.7.12)
map_on_demand:1
concurrent_sends:63
peer_credits:8
C3 (cray 2.11 router)
map_on_demand:0
concurrent_sends:16
peer_credits:8 (o2ib)
peer_credits:16 (gni).
C4 (cray 2.11 router)
map_on_demand:0
concurrent_sends:63
peer_credits:8 (o2ib)
peer_credits:16 (gni)
Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.