[LU-1394] lctl ping fails with EIO on freshly restarted node Created: 09/May/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Brian Murrell (Inactive) | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Low Priority | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4035 |
| Description |
|
I have two nodes, "oss1" and "mds1". I can lctl ping from oss1 to mds1 with no problem, proving basic infrastructure is correct. If I kill mds1 (i.e. pull the power cord) and boot it back up and start lnet, the first lctl ping from oss1 to it will fail with an EIO. A subsequent lctl ping will succeed. In terms of network traffic when this happens, 3 TCP sessions from oss1 to mds1 are established, serially – that is 1 is opened and then closed, 2 is opened and then closed and 3 is opened and left open. The first two exchange a few packets each and close down gracefully. The third seems to be much longer lived and is in fact the session where both the failed and then the successful lctl ping happen and it continues to live on. Clearly this problem is not an artefact of TCP or IP and is entirely an artefact of LNET itself, so it seems that LNET ought to be able to handle this situation more gracefully. |
| Comments |
| Comment by Brian Murrell (Inactive) [ 09/May/12 ] |
|
I should expand a bit on this and outline the real problem we are seeing. In the above scenario, if "mds1" is the MGS, registration of a new OST can fail to reach this freshly rebooted MGS and results in: # mount -t lustre -o loop /var/tmp/test_ost /mnt/lustre/ost3
mount.lustre: mount /dev/loop1 at /mnt/lustre/ost3 failed: Input/output error
Is the MGS running?
The syslog from this: May 9 16:40:11 oss1 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: May 9 16:40:11 oss1 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: May 9 16:40:19 oss1 kernel: Lustre: 2119:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1401506780676134 sent from MGC192.168.122.155@tcp to NID 192.168.122.89@tcp has timed out for slow reply: [sent 1336581608] [real_sent 1336581608] [current 1336581619] [deadline 11s] [delay 0s] req@ffff88003da8a800 x1401506780676134/t0(0) o-1->MGS@MGC192.168.122.155@tcp_1:26/25 lens 368/512 e 0 to 1 dl 1336581619 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 May 9 16:40:19 oss1 kernel: Lustre: MGC192.168.122.155@tcp: Reactivating import May 9 16:40:19 oss1 kernel: Lustre: 2333:0:(import.c:526:import_select_connection()) MGC192.168.122.155@tcp: tried all connections, increasing latency to 11s May 9 16:40:30 oss1 kernel: LustreError: 2333:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88000d9f6400 x1401506780676136/t0(0) o-1->MGS@MGC192.168.122.155@tcp_0:26/25 lens 4736/4736 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:40:30 oss1 kernel: LustreError: 2333:0:(obd_mount.c:1163:server_start_targets()) Required registration failed for lustre-OSTffff: -5 May 9 16:40:30 oss1 kernel: LustreError: 15f-b: Communication error with the MGS. Is the MGS running? May 9 16:40:30 oss1 kernel: LustreError: 2333:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -5 May 9 16:40:30 oss1 kernel: LustreError: 2333:0:(obd_mount.c:1512:server_put_super()) no obd lustre-OSTffff May 9 16:40:30 oss1 kernel: LustreError: 2333:0:(obd_mount.c:141:server_deregister_mount()) lustre-OSTffff not registered May 9 16:40:30 oss1 kernel: Lustre: server umount lustre-OSTffff complete May 9 16:40:30 oss1 kernel: LustreError: 2333:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-5) May 9 16:40:31 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88000812b000 x1401506780676137/t0(0) o-1->MGS@MGC192.168.122.155@tcp_0:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:40:37 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88000812b000 x1401506780676138/t0(0) o-1->MGS@MGC192.168.122.155@tcp_0:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:40:43 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88000812b000 x1401506780676139/t0(0) o-1->MGS@MGC192.168.122.155@tcp_0:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:40:54 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88003da8a000 x1401506780676140/t0(0) o-1->MGS@MGC192.168.122.155@tcp_0:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:41:00 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88003da8a000 x1401506780676141/t0(0) o-1->MGS@MGC192.168.122.155@tcp_1:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:41:14 oss1 kernel: Lustre: 2119:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1401506780676142 sent from MGC192.168.122.155@tcp to NID 192.168.122.89@tcp has timed out for slow reply: [sent 1336581658] [real_sent 1336581658] [current 1336581674] [deadline 16s] [delay 0s] req@ffff88003d883800 x1401506780676142/t0(0) o-1->MGS@MGC192.168.122.155@tcp_1:26/25 lens 368/512 e 0 to 1 dl 1336581674 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 May 9 16:41:14 oss1 kernel: Lustre: 2119:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message May 9 16:41:16 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff88000812b000 x1401506780676144/t0(0) o-1->MGS@MGC192.168.122.155@tcp_1:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 May 9 16:41:16 oss1 kernel: LustreError: 2339:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message May 9 16:41:23 oss1 kernel: Lustre: 2120:0:(import.c:526:import_select_connection()) MGC192.168.122.155@tcp: tried all connections, increasing latency to 16s May 9 16:41:23 oss1 kernel: Lustre: MGC192.168.122.155@tcp: Reactivating import A second attempt at mounting is successful and the syslog from that: May 9 16:49:37 oss1 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: May 9 16:49:38 oss1 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: May 9 16:49:38 oss1 kernel: Lustre: lustre-OST0002: new disk, initializing May 9 16:49:38 oss1 kernel: Lustre: 2393:0:(filter.c:1241:filter_prep_groups()) lustre-OST0002: initialize groups [0,0] May 9 16:49:38 oss1 kernel: Lustre: lustre-OST0002: Now serving lustre-OST0002 on /dev/loop1 with recovery enabled May 9 16:49:38 oss1 kernel: LustreError: 2383:0:(obd_class.h:1622:obd_notify()) obd lustre-OST0002 has no notify handler May 9 16:49:43 oss1 kernel: Lustre: 2289:0:(ldlm_lib.c:877:target_handle_connect()) lustre-OST0002: connection from lustre-MDT0000-mdtlov_UUID@192.168.122.89@tcp t0 exp (null) cur 1336582183 last 0 May 9 16:49:43 oss1 kernel: Lustre: 2289:0:(filter.c:2697:filter_connect_internal()) lustre-OST0002: Received MDS connection for group 0 May 9 16:49:43 oss1 kernel: Lustre: 2289:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0002-> netid 20000: select flavor null |
| Comment by Robert Read (Inactive) [ 09/May/12 ] |
|
It is not necessarily a bug for LNET to return an error, though actually I don't see any lnet errors in your log. It looks like the EIO there is from ptlrpc_import_delay_req because the request exceeded the send time limit while waiting for the import to be connect, so it seems we're not retrying to connect to the MGS soon enough. |
| Comment by Peter Jones [ 10/May/12 ] |
|
Doug Could you please treat this issue as a priority? Thanks Peter |
| Comment by Doug Oucharek (Inactive) [ 12/May/12 ] |
|
In working to reproduce this with my own VM's, I have found the following: When the MGS is "reset", the OSS keeps trying to communicate with it over one TCP connection. After about 50 seconds, the OSS LNet layer times out the TCP connection and closes it. After that point, any attempt to communicate with the MSG (i.e. lctl ping) will result in an attempt to re-open the TCP connection. If the MGS LNet layer is up and running, that will succeed and the ping will succeed. If the MGS LNet layer is not up, the attempt to re-open the TCP connection will fail and the ping will fail. The above is expected behaviour. A window does exist if the MGS boots up and starts running its LNet module within the 50 second timeout. In this window, a ping (or any communication) will try to use the still open TCP socket but will timeout as the TCP connection has no other endpoint. Even if LNet is running on the MGS, use of the old TCP connection will experience an error as the MGS OS will send back an RST TCP packet to reset the connection (i.e. close it) because it knows nothing of that connection anymore. This means that the ping will fail even though the MGS LNet layer is running. This is what I believe is happening in the failed test. I am unable to reproduce this exact scenario as I cannot get my VM to reboot faster than about 2 minutes. In that time, the TCP connection gets closed due to a timeout. A potential way to fix this is to attempt a new TCP connection when the current one is taking too long to respond. However, this is dangerous as it may trigger unnecessary TCP connection closures/re-opens on very busy TCP channels. Also the extra time to re-establish a TCP connection may cause the current request (i.e. the ping) to timeout anyway. A question I have is this: if a full 1 - 1.5 minutes expires after the reset of the MGS before trying any actions, does the first ping (or mount) fail then? |
| Comment by Kelsey Prantis (Inactive) [ 12/May/12 ] |
|
I've seen this fail 15-20 minutes after reboot, and it always works on the second mount attempt, so I do not think what we are seeing is linked to a TCP timeout. |
| Comment by Doug Oucharek (Inactive) [ 14/May/12 ] |
|
That is very strange...15-20 minutes and the connection is not being closed. I must not be doing something that you are as I cannot reproduce that behaviour. Do you see any TCP traffic from the OSS to MGS while the MGS is rebooting? I do and it is that traffic which triggers the TCP connection to close (after a timeout period). |
| Comment by Liang Zhen (Inactive) [ 14/May/12 ] |
|
Hi, a quick question, is "keepalive" of socklnd disabled? (i.e: ksocklnd keepalive=0) Liang |
| Comment by Peter Jones [ 31/May/12 ] |
|
I understand from Brian that this issue is being worked around so I am dropping in priority for now |