[LU-11815] MDT-MDT connection stuck and never restored Created: 19/Dec/18 Updated: 19/Dec/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.12.0-RC3 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
running mdtest on dne2 configuration (two MDS and one MDT per MDS), MDT-MDT connection disconnected seveal times and reconnection fails and aborted. Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:02:18 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:02:19 mds13 kernel: Lustre: Skipped 29 previous similar messages Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257025/real 0] req@ffff91e9f7715700 x1620318877676672/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 368/4320 e 0 to 1 dl 1545257036 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:17:42 mds13 kernel: LustreError: Skipped 3 previous similar messages Dec 20 07:17:47 mds13 kernel: LNet: 71968:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 51 previous similar messages Dec 20 07:20:01 mds13 systemd[1]: Started Session 332 of user root. Dec 20 07:20:01 mds13 systemd[1]: Starting Session 332 of user root. Dec 20 07:20:40 mds13 systemd-logind[2938]: Removed session 331. Dec 20 07:19:40 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:19:40 mds13 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:20:06 mds13 kernel: LNet: 71965:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.11.226@o2ib10: connected Dec 20 07:20:12 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:21:52 mds13 systemd-logind[2938]: New session 333 of user root. Although it's simple network configuration (just single switch) and I didn't see any network errors between MDS and clent/OSS, it suspect still there are network problems between MDSs? |
| Comments |
| Comment by Shuichi Ihara [ 19/Dec/18 ] |
|
OSSs-MDSs are fine, but MDS-MDS are stuck or fail. [root@mds13 ~]# lctl list_nids 10.0.11.225@o2ib10 [root@mds14 ~]# lctl list_nids 10.0.11.226@o2ib10 [root@mds13 ~]# lctl ping 10.0.11.226@o2ib10 ^C [root@mds14 ~]# lctl ping 10.0.11.225@o2ib10 failed to ping 10.0.11.225@o2ib10: Input/output error [root@mds13 ~]# clush -g oss lctl ping 10.0.11.225@o2ib10 es14k3-vm1: 12345-0@lo es14k3-vm1: 12345-10.0.11.225@o2ib10 es14k3-vm2: 12345-0@lo es14k3-vm2: 12345-10.0.11.225@o2ib10 es14k3-vm3: 12345-0@lo es14k3-vm3: 12345-10.0.11.225@o2ib10 es14k3-vm4: 12345-0@lo es14k3-vm4: 12345-10.0.11.225@o2ib10 [root@mds13 ~]# clush -g oss lctl ping 10.0.11.226@o2ib10 es14k3-vm1: 12345-0@lo es14k3-vm1: 12345-10.0.11.226@o2ib10 es14k3-vm2: 12345-0@lo es14k3-vm2: 12345-10.0.11.226@o2ib10 es14k3-vm3: 12345-0@lo es14k3-vm3: 12345-10.0.11.226@o2ib10 es14k3-vm4: 12345-0@lo es14k3-vm4: 12345-10.0.11.226@o2ib10 |