[LU-11931] RDMA packets sent from client to MGS are timing out - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
Affects Version/s: Lustre 2.11.0
Labels:
- ORNL
Environment:
Cray CLE6 system running 2.11 clients with 2.11 servers.

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have seen in a production system the following error which are causing clients to be evicted.

[85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds

[85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8

[123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995

For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:

F1 file system server backend with 2.8.2 stack, ldiskfs:

map_on_demand:0

concurrent_sends:0

peer_credits:8

F2 file system server 2.11 (ZFS 0.7.12)

map_on_demand:1

concurrent_sends:63

peer_credits:8

C3 (cray 2.11 router)

map_on_demand:0

concurrent_sends:16

peer_credits:8 (o2ib)

peer_credits:16 (gni).

C4 (cray 2.11 router)

map_on_demand:0

concurrent_sends:63

peer_credits:8 (o2ib)

peer_credits:16 (gni)

Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.

Attachments

Issue Links

is related to

LU-12279 client got evicted due to network issue.

Resolved

is related to

LU-10291 remove concurrent_sends tunable

Resolved

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 05/Feb/19 7:34 PM

Updated:: 22/May/19 2:15 PM

Resolved:: 21/Apr/19 1:21 PM