[LU-11815] MDT-MDT connection stuck and never restored Created: 19/Dec/18  Updated: 19/Dec/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.12.0-RC3


Attachments: Text File messages-mds13.txt     Text File messages-mds14.txt    
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

running mdtest on dne2 configuration (two MDS and one MDT per MDS), MDT-MDT connection disconnected seveal times and reconnection fails and aborted.
As far as I see log, there are some indication of network errors between MDS.

Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63
Dec 20 07:02:18 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63  to allow for qp creation
Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID
Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10)
Dec 20 07:02:19 mds13 kernel: Lustre: Skipped 29 previous similar messages
Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257025/real 0]  req@ffff91e9f7715700 x1620318877676672/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 368/4320 e 0 to 1 dl 1545257036 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Dec 20 07:17:42 mds13 kernel: LustreError: Skipped 3 previous similar messages
Dec 20 07:17:47 mds13 kernel: LNet: 71968:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10
Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63  to allow for qp creation
Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 51 previous similar messages
Dec 20 07:20:01 mds13 systemd[1]: Started Session 332 of user root.
Dec 20 07:20:01 mds13 systemd[1]: Starting Session 332 of user root.
Dec 20 07:20:40 mds13 systemd-logind[2938]: Removed session 331.
Dec 20 07:19:40 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Dec 20 07:19:40 mds13 kernel: Lustre: Skipped 1 previous similar message
Dec 20 07:20:06 mds13 kernel: LNet: 71965:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.11.226@o2ib10: connected
Dec 20 07:20:12 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10
Dec 20 07:21:52 mds13 systemd-logind[2938]: New session 333 of user root.

Although it's simple network configuration (just single switch) and I didn't see any network errors between MDS and clent/OSS, it suspect still there are network problems between MDSs?



 Comments   
Comment by Shuichi Ihara [ 19/Dec/18 ]

OSSs-MDSs are fine, but MDS-MDS are stuck or fail.

[root@mds13 ~]# lctl list_nids
10.0.11.225@o2ib10
[root@mds14 ~]#  lctl list_nids
10.0.11.226@o2ib10

[root@mds13 ~]# lctl ping  10.0.11.226@o2ib10
^C
[root@mds14 ~]# lctl ping  10.0.11.225@o2ib10
failed to ping 10.0.11.225@o2ib10: Input/output error
[root@mds13 ~]# clush -g oss lctl ping 10.0.11.225@o2ib10
es14k3-vm1: 12345-0@lo
es14k3-vm1: 12345-10.0.11.225@o2ib10
es14k3-vm2: 12345-0@lo
es14k3-vm2: 12345-10.0.11.225@o2ib10
es14k3-vm3: 12345-0@lo
es14k3-vm3: 12345-10.0.11.225@o2ib10
es14k3-vm4: 12345-0@lo
es14k3-vm4: 12345-10.0.11.225@o2ib10
[root@mds13 ~]# clush -g oss lctl ping 10.0.11.226@o2ib10
es14k3-vm1: 12345-0@lo
es14k3-vm1: 12345-10.0.11.226@o2ib10
es14k3-vm2: 12345-0@lo
es14k3-vm2: 12345-10.0.11.226@o2ib10
es14k3-vm3: 12345-0@lo
es14k3-vm3: 12345-10.0.11.226@o2ib10
es14k3-vm4: 12345-0@lo
es14k3-vm4: 12345-10.0.11.226@o2ib10
Generated at Sat Feb 10 02:47:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.