[LU-1733] Intermittent issue with some clients not being able to access OSTS. LustreError: 137-5: UUID 'fs-OST0006_UUID' is not available for connect (no target) Created: 10/Aug/12 Updated: 15/Mar/14 Resolved: 15/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Peter Piela (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Terascala Lustre appliance. Lustre version: 1.8.4. OS: CentOS 5.4 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 3985 |
| Description |
|
The issue as I understand is that the user will have clients (both 1g and 10g) that will lose connection to an OSS. That client will not be able to lctl ping the OSS, though I believe it is able to ping it. The OSS is reachable by other clients on the same network, via ping and lctl ping. The OSS appears to be fine from what we've been able to see. Logs are from ts-server-05 and 06 (because they're not 07 or 08). The UUID messages, with context, no intervening lines have been Aug 6 19:24:59 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 5df98fca-2127-4997-2987-27812afa3ec8 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it. Aug 6 19:26:46 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 474cbe0a-7ca0-b06e-994f-3765fd83b501 (at 192.168.255.244@tcp) in 227 seconds. I think it's dead, a nd I am evicting it. Aug 6 19:31:48 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0006_UUID' is not available for connect (no target) Aug 6 19:31:48 ts-server-05 kernel: LustreError: 23182:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8102a3b2a800 x1408051660878194/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344281608 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 19:31:48 ts-server-05 kernel: LustreError: 23182:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 6 19:31:48 ts-server-05 kernel: LustreError: Skipped 1 previous similar message Aug 6 19:31:48 ts-server-05 kernel: LustreError: 23230:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff81041551b000 x1408051660878193/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344281608 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 20:50:15 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client e867e149-84ad-a3a2-b7d3-fba60c7e6f6d (at 10.145.62.25@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 6 20:50:15 ts-server-05 kernel: Lustre: Skipped 1 previous similar message Aug 7 17:05:47 ts-server-05 ts-log-date: Timestamp 1344359143 Aug 7 17:06:01 ts-server-05 ts-log-date: Timestamp 1344359158 Aug 7 17:26:38 ts-server-05 kernel: LustreError: 14619:0:(acceptor.c:435:lnet_acceptor()) Refusing connection from 165.140.72.11: insecure port 42301 Aug 7 17:26:38 ts-server-05 rpc.statd[5109]: recv_rply: can't decode RPC message! Aug 7 17:28:48 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 29c8167a-8f77-18d6-9d6c-40740141bd9f (at 160.62.219.80@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 7 17:28:48 ts-server-05 kernel: Lustre: Skipped 1 previous similar message Aug 7 17:28:48 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 29c8167a-8f77-18d6-9d6c-40740141bd9f (at 160.62.219.80@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 7 17:42:38 ts-server-05 kernel: Lustre: 23275:0:(ldlm_lib.c:572:target_handle_reconnect()) fs-OST0005: ccc0fafc-ecbd-ea39-5e31-bba746b6693e reconnecting Aug 7 17:42:38 ts-server-05 kernel: Lustre: 23173:0:(ldlm_lib.c:572:target_handle_reconnect()) fs-OST0004: ccc0fafc-ecbd-ea39-5e31-bba746b6693e reconnecting Aug 7 19:02:45 ts-server-05 kernel: schedule_timeout: wrong timeout value ffffffffffffff1d from ffffffff886c8354 Aug 7 21:57:59 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 50ec0e03-4fbc-d138-c26d-6554dfa789ed (at 160.62.219.68@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 7 22:00:52 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0007_UUID' is not available for connect (no target) Aug 7 22:00:52 ts-server-05 kernel: LustreError: 23257:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810126baf800 x1409022347227196/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344376952 ref 1 fl Interpret:/0/0 rc -19/0 Aug 7 22:00:52 ts-server-05 kernel: LustreError: Skipped 1 previous similar message Aug 7 22:01:33 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0007_UUID' is not available for connect (no target) Aug 7 22:01:33 ts-server-05 kernel: LustreError: 23293:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810147de7c00 x1409022347230552/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344376993 ref 1 fl Interpret:/0/0 rc -19/0 Aug 7 22:01:33 ts-server-05 kernel: LustreError: 23293:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 7 22:01:33 ts-server-05 kernel: LustreError: Skipped 1 previous similar message Aug 8 00:49:15 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it. Aug 8 00:49:15 ts-server-05 kernel: Lustre: Skipped 1 previous similar message Aug 8 00:49:16 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it. Aug 8 00:50:52 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0006_UUID' is not available for connect (no target) Aug 8 00:50:52 ts-server-05 kernel: LustreError: 23239:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810082de2400 x1409580924063913/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344387152 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:50:52 ts-server-05 kernel: LustreError: 23239:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 8 00:50:52 ts-server-05 kernel: LustreError: Skipped 1 previous similar message It's interesting to note that all the UUID messages pertain to OSTs that are active on the other OSS in this pair. Looking at the other node in this pair for roughly the same time frame gives us the following messages: Aug 7 23:40:28 ts-server-06 kernel: LustreError: Skipped 23 previous similar messages Aug 7 23:50:33 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available for connect (no target) Aug 7 23:50:33 ts-server-06 kernel: LustreError: 24143:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8103c899d000 x1408051529009808/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344383533 ref 1 fl Interpret:/0/0 rc -19/0 Aug 7 23:50:33 ts-server-06 kernel: LustreError: 24143:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 27 previous similar messages Aug 7 23:50:33 ts-server-06 kernel: LustreError: Skipped 27 previous similar messages Aug 8 00:01:18 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available for connect (no target) Aug 8 00:01:18 ts-server-06 kernel: LustreError: 24204:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8102df7ee000 x1408051529010179/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344384178 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:01:18 ts-server-06 kernel: LustreError: 24204:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 29 previous similar messages Aug 8 00:01:18 ts-server-06 kernel: LustreError: Skipped 29 previous similar messages Aug 8 00:11:34 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available for connect (no target) Aug 8 00:11:34 ts-server-06 kernel: LustreError: 24198:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8104138fa800 x1408051529010508/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344384794 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:11:34 ts-server-06 kernel: LustreError: 24198:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 19 previous similar messages Aug 8 00:11:34 ts-server-06 kernel: LustreError: Skipped 19 previous similar messages Aug 8 00:21:47 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available for connect (no target) Aug 8 00:21:47 ts-server-06 kernel: LustreError: 24583:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8101d8259000 x1408051529010879/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344385407 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:21:47 ts-server-06 kernel: LustreError: 24583:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 35 previous similar messages Aug 8 00:21:47 ts-server-06 kernel: LustreError: Skipped 35 previous similar messages Aug 8 00:32:20 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available for connect (no target) Aug 8 00:32:20 ts-server-06 kernel: LustreError: 24190:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810414436800 x1408051529011224/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344386040 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:32:20 ts-server-06 kernel: LustreError: 24190:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 21 previous similar messages Aug 8 00:32:20 ts-server-06 kernel: LustreError: Skipped 21 previous similar messages Aug 8 00:38:29 ts-server-06 kernel: schedule_timeout: wrong timeout value ffffffffffffffe5 from ffffffff886c8354 Aug 8 00:42:29 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available for connect (no target) Aug 8 00:42:29 ts-server-06 kernel: LustreError: 24668:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff81003d97dc00 x1408051529011582/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344386649 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:42:29 ts-server-06 kernel: LustreError: 24668:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 31 previous similar messages Aug 8 00:42:29 ts-server-06 kernel: LustreError: Skipped 31 previous similar messages Aug 8 00:48:14 ts-server-06 kernel: schedule_timeout: wrong timeout value ffffffffffffffb2 from ffffffff886c8354 Aug 8 00:49:19 ts-server-06 kernel: Lustre: fs-OST0006: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 8 00:49:19 ts-server-06 kernel: Lustre: Skipped 1 previous similar message Aug 8 00:49:19 ts-server-06 kernel: Lustre: fs-OST0007: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 8 00:52:05 ts-server-06 kernel: schedule_timeout: wrong timeout value fffffffffffffe82 from ffffffff886c8354 Aug 8 00:53:06 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available for connect (no target) Aug 8 00:53:06 ts-server-06 kernel: LustreError: 24591:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8103661ad400 x1408051529011939/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344387286 ref 1 fl Interpret:/0/0 rc -19/0 Aug 8 00:53:06 ts-server-06 kernel: LustreError: 24591:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 25 previous similar messages Aug 8 00:53:06 ts-server-06 kernel: LustreError: Skipped 25 previous similar messages Aug 8 01:02:10 ts-server-06 kernel: schedule_timeout: wrong timeout value fffffffffffffee6 from ffffffff886c8354 Aug 8 01:03:20 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available for connect (no target) Aug 8 01:03:20 ts-server-06 kernel: LustreError: 24237:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8101ce3ba400 x1408051529012283/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344387900 ref 1 fl Interpret:/0/0 rc -19/0 It's interesting to note that the times do not line up. Without direct access to the servers I can't verify that ntp hasn't failed and that time hasn't drifted, but from what I can correlate within the log files, I'm certain the times are within a minute of each other, and very likely are closer than that. Client evictions seem to happen mostly in blocks, as you'd expect (dead client). There is, in the current log, one exception which is a single client evicted only from the 04 server, and the 08 server which I'm narrowing down to see if it's an exception due to its network issue (resolved on the 4th?) |
| Comments |
| Comment by Peter Piela (Inactive) [ 10/Aug/12 ] |
|
Lustre client version $ cat /proc/fs/lustre/version |
| Comment by Peter Jones [ 10/Aug/12 ] |
|
Peter Engineering are reviewing the information on this ticket. In the meantime can you please clarify - is the site completely down at this point? From the description below it sounds like there is service disruption but only to a subset of the cluster Peter |
| Comment by Johann Lombardi (Inactive) [ 10/Aug/12 ] |
|
The messages "UUID 'xxx' is not available for connect (no target)" are normal. When a client fails to access a target (OST or MDT) after two attempts, it tries via the failover partner. ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it. But a client is trying to reach this OST via ts-server-06, hence this message: Aug 8 00:11:34 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available for connect (no target) The real problems are: 1. why did this client get evicted in the first place? and 2. why it hasn't managed to reconnect to ts-server-05 just after the eviction. From the logs: Aug 8 00:49:15 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 8 00:49:15 ts-server-05 kernel: Lustre: Skipped 1 previous similar message Aug 8 00:49:16 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, and I am evicting it. we can tell that the client got evicted by the ping evictor because it hasn't sent a ping request to the server for a while. To know more about what happened, could you please provide us with the logs of 192.168.255.254@tcp around the same time? By the way, is this bug really hurting a production system? If not, it should probably not be a S1. Thanks. |
| Comment by Steve Butler (Inactive) [ 10/Aug/12 ] |
|
Yes I want to emphasize the priority. It is a very high visibility account - top 3 Pharma and the main user is one of the top researchers in genomics. If he endorses Lustre then it will spread into this segment. This stability issue has been tricky. Their jobs run for 20+ hours and then randomly crash. |
| Comment by Johann Lombardi (Inactive) [ 10/Aug/12 ] |
|
While waiting for the client logs ... i noticed the following message: Aug 7 17:26:38 ts-server-05 kernel: LustreError: 14619:0:(acceptor.c:435:lnet_acceptor()) Refusing connection from 165.140.72.11: insecure port 42301 Aug 7 17:26:38 ts-server-05 rpc.statd[5109]: recv_rply: can't decode RPC message! Could you please tell us more about your lnet configuration? Did you change the acceptor port? If so, did you configure the client to use the same port? Please advise. |
| Comment by Doug Oucharek (Inactive) [ 10/Aug/12 ] |
|
With regards to the lnet_acceptor log Johann has pointed out: do any of the routers between the client and servers have NAT turned on? NAT can change the source port number and when it exceeds 1023, that can trigger this log and subsequent communication issues. |
| Comment by Andreas Dilger [ 10/Aug/12 ] |
|
Johann, I noticed the same thing. It looks like a TCP port scan happening, since this is both the Lustre acceptor and the NFS stat daemon. Possibly this is done by internal security software, or it is possible that there is malicious software active on their network scanning for vulnerable services? It is not recommended that Lustre filesystems are accessible to the Internet at large, since there is only host-based authentication, and any untrusted host could connect to the filesystem. The other messages that I noticed were: Aug 8 00:38:29 ts-server-06 kernel: schedule_timeout: wrong timeout value ffffffffffffffe5 from ffffffff886c8354 Aug 8 00:48:14 ts-server-06 kernel: schedule_timeout: wrong timeout value ffffffffffffffb2 from ffffffff886c8354 Aug 8 00:52:05 ts-server-06 kernel: schedule_timeout: wrong timeout value fffffffffffffe82 from ffffffff886c8354 This indicates some code is trying to sleep for a negative amount of time. This specific case is handled (and ignored) by the kernel, but it may be possible that the calculation sometimes works out to "0", which means "sleep forever" and could hold kernel service threads busy indefinitely until they are woken up for some other reason. It would be worthwhile to determine what module/function the address 0xffffffff886c8354 maps to. This could be done by looking for this address in /proc/kallsyms, sorted by address. |
| Comment by Johann Lombardi (Inactive) [ 13/Aug/12 ] |
|
Still waiting for feedback from Terascala. |
| Comment by Tom Riemer (Inactive) [ 14/Aug/12 ] |
|
As Steve said, this is a production system, and this bug is causing jobs to fail. I've attached the logs from 192.168.255.254 from the client. I'm not sure this is going to be an indicative client to look at, as the logs I have from the servers show evictions on 11/12 OSTs and MDT, making it look like a complete client failure. I've also requested logs from 160.62.219.92 which show something that seems more useful to me (it's evicted from only 2 OSTs, both on the same OSS). Relevant logs from the OSS below: Aug 6 08:00:39 ts-server-04 kernel: Lustre: server-OST0002: haven't heard from client 4300f4e3-a276-d9df-eae3-00a566b11338 (at 160.62.219.82@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 6 09:13:43 ts-server-04 kernel: LustreError: 137-5: UUID 'server-OST0001_UUID' is not available for connect (no target) Aug 6 09:13:43 ts-server-04 kernel: LustreError: 15362:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8104058cec00 x1409022154248387/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344244523 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 09:13:43 ts-server-04 kernel: LustreError: 15362:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 6 09:13:43 ts-server-04 kernel: LustreError: Skipped 1 previous similar message Aug 6 09:13:43 ts-server-04 kernel: LustreError: 15443:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810411272c00 x1409022154248388/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344244523 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 09:13:58 ts-server-04 kernel: LustreError: 137-5: UUID 'server-OST0001_UUID' is not available for connect (no target) Aug 6 09:13:58 ts-server-04 kernel: LustreError: 15474:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff81015c6ad800 x1409022154255082/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344244538 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 09:13:58 ts-server-04 kernel: LustreError: Skipped 1 previous similar message Aug 6 09:14:15 ts-server-04 kernel: LustreError: 137-5: UUID 'server-OST0000_UUID' is not available for connect (no target) Aug 6 09:14:15 ts-server-04 kernel: LustreError: 14678:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8101fc9bf400 x1409022154262862/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344244555 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 09:14:15 ts-server-04 kernel: LustreError: 14678:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 6 09:14:15 ts-server-04 kernel: LustreError: Skipped 1 previous similar message Aug 6 09:14:34 ts-server-04 kernel: LustreError: 137-5: UUID 'server-OST0001_UUID' is not available for connect (no target) Aug 6 09:14:34 ts-server-04 kernel: LustreError: 15471:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8102f995dc00 x1409022154271285/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344244574 ref 1 fl Interpret:/0/0 rc -19/0 Aug 6 09:14:34 ts-server-04 kernel: LustreError: 15471:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 6 09:14:34 ts-server-04 kernel: LustreError: Skipped 1 previous similar message Aug 6 09:32:06 ts-server-04 kernel: Lustre: server-OST0002: haven't heard from client f9ee7471-05e3-470f-c90a-a449aafa6cd3 (at 160.62.219.92@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 6 09:32:06 ts-server-04 kernel: Lustre: Skipped 1 previous similar message Aug 6 09:32:06 ts-server-04 kernel: Lustre: server-OST0003: haven't heard from client f9ee7471-05e3-470f-c90a-a449aafa6cd3 (at 160.62.219.92@tcp) in 227 seconds. I think it's dead, and I am evicting it. Aug 6 13:30:39 ts-server-04 kernel: LustreError: 137-5: UUID 'server-OST0001_UUID' is not available for connect (no target) Aug 6 13:30:39 ts-server-04 kernel: LustreError: 15494:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff81041a629000 x1408397551874266/t0 o8-><?>@<?>:0/0 l |
| Comment by Johann Lombardi (Inactive) [ 14/Aug/12 ] |
|
Tom, I can't find the logs for 192.168.255.254 in the tarballs your uploaded. As for 160.62.219.92, the log file only starts at Aug 6 15:25:19. Could you please provide us with the logs prior to that? Aug 6 15:25:19 compute-0-60 kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range Aug 6 15:25:19 compute-0-60 kernel: i8042.c: No controller found. Aug 6 15:25:19 compute-0-60 kernel: LustreError: 2362:0:(socklnd.c:2599:ksocknal_enumerate_interfaces()) Can't find any usable interfaces Aug 6 15:25:19 compute-0-60 kernel: LustreError: 105-4: Error -100 starting up LNI tcp Aug 6 15:25:19 compute-0-60 kernel: LustreError: 2362:0:(events.c:725:ptlrpc_init_portals()) network initialisation failed Aug 6 15:25:19 compute-0-60 kernel: dca service started, version 1.4 Aug 6 15:25:57 compute-0-60 sshd[4211]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use. Aug 6 15:26:03 compute-0-60 fsmpm[4332]: PortMapper: clock jumped backward 0.4s Aug 6 15:26:05 compute-0-60 smartd[4591]: Problem creating device name scan list Aug 7 15:21:33 compute-0-60 kernel: LustreError: 3808:0:(acceptor.c:435:lnet_acceptor()) Refusing connection from 160.62.255.62: insecure port 41537 As asked before, could you please tell us more about the network configuration? |
| Comment by Tom Riemer (Inactive) [ 14/Aug/12 ] |
|
The messages file appears to out of order, I'm guessing someone catted the messages files together before sending them to us. compute-0-60 is, as far as I am aware, 192.168.255.254@tcp I didn't have the 160.62.219.92 logs until this morning. It does look like I uploaded the same file twice (the 192. one) I've attached a diagram of their network layout |
| Comment by Johann Lombardi (Inactive) [ 14/Aug/12 ] |
|
192_168_255_254.LOGS.tgz seems to be actually the logs of 160.62.219.60@tcp: $ tar zxvf 192_168_255_254.LOGS.tgz LOGS/ LOGS/lustre.timeout LOGS/lustre.lnet.nis LOGS/lustre.lnet.peers LOGS/lustre.max_read_ahead_whole_mb LOGS/lustre.ldlm_timeout LOGS/var_log_messages_all.1344867778 LOGS/lustre.qos_prio_free LOGS/lustre.version LOGS/dmesg.1344875769 LOGS/lustre.devices LOGS/lustreINFO.sh LOGS/lustre.max_cached_mb LOGS/lustre.max_rw_chunk LOGS/lustre_lctl_get_param__osc.all.timeouts LOGS/lustre.max_read_ahead_mb LOGS/lustre.max_read_ahead_per_file_mb LOGS/lustre.qos_maxage $ grep -r 192.168.255 * $ grep -r 160.62.219 * dmesg.1344875769:Lustre: Added LNI 160.62.219.60@tcp [8/256/0/180] lustre.lnet.nis:160.62.219.60@tcp up -1 9 8 0 256 256 243 AFAICS, we still don't have the logs of 192.168.255.254. As for 160.62.219.92, it got evicted from OST0002 just after some error messages from the network stack: Aug 6 05:30:14 compute-0-92 kernel: LustreError: 16951:0:(linux-tcpip.c:469:libcfs_sock_create()) Error trying to bind to port 1023: -99 Aug 6 05:30:14 compute-0-92 kernel: LustreError: 11e-e: Unexpected error -99 connecting to 10.145.61.155@tcp at host 10.145.61.155 on port 988 Aug 6 05:30:14 compute-0-92 fsmpm[4303]: Portmapper: ras_try_forward_event: FsmPMConnect failed (101), The File System Services on 10.145.6.58 may be stopped. Aug 6 05:32:34 compute-0-92 kernel: LustreError: 11e-e: Unexpected error -22 connecting to 10.145.61.153@tcp at host 10.145.61.153 on port 988 Aug 6 05:33:19 compute-0-92 kernel: LustreError: 167-0: This client was evicted by server-OST0002; in progress operations using this service will fail. It really seems that you have network problems/misconfiguration and, from the logs, Lustre isn't the only service to suffer from that. |
| Comment by Doug Oucharek (Inactive) [ 14/Aug/12 ] |
|
In the client logs, I am seeing a lot of network errors from fsmpm. My understanding is this is the File System Portmapper used by the Xsan file system. Is Xsan active on the clients? I'm wondering if this portmapper is causing problems for Lustre since we use pre-defined port numbers? If Xsan is active, it is possible to disable it to see if that stabilizes the client? |
| Comment by Peter Piela (Inactive) [ 14/Aug/12 ] |
|
Thanks Doug. I have passed your comments/request to the customer, and am waiting for a response. |
| Comment by Peter Piela (Inactive) [ 16/Aug/12 ] |
|
The fsmpm portmapper daemon is associated with a StorNext deployment. The cluster compute nodes have access to both StorNext and Lustre filesystems. It looks like you can configure the range of ports that StorNext can use with a min/max options in the fsports configuration file. The customer is setting up a call with StorNext to review the situation. |
| Comment by Peter Piela (Inactive) [ 23/Aug/12 ] |
|
Based on discussions with the customer and StorNext the consensus is that (1) there is no contention issue with StorNext, and (2) the Lustre connectivity issues cannot be explained by the weekly port-scanning activity. The customer ran their benchmark workloads on Tuesday night (8/21), and saw the same failures with IOException errors being reported by the applications. The customer has observed that the problem occurs when the cluster is significantly utilized. What are your thoughts on next steps? |
| Comment by Liang Zhen (Inactive) [ 27/Aug/12 ] |
|
do you have "options lnet accept=all" in lustre.conf (or modprobe.conf) ? |
| Comment by Doug Oucharek (Inactive) [ 27/Aug/12 ] |
|
Given that the problem occurs when the cluster is significantly utilized, I have to assume there is some nature of networking issue. It is interesting that StorNext is logging connection issues as well. My suspicions, then, turn to the network itself and possibly the servers. As well as checking the modprobe options as Liang has indicated above, can the following be done: 1- On a client which has experienced this issue, can the results of "netstat -s" be posted here. This will allow us to see networking errors such as frequent retransmission and dropped/reset connections. 2- On the servers, can the following be run: "echo +neterror > /proc/sys/lnet/printk". This will provide more networking error messages on the server's console. 3- If there is a network pathway which seems to be experiencing issues more often, would it be possible to run the LNet selftests (via lst utility)? This was created to exercise and push the boundaries of a network pathway. If there is something unstable with that pathway (i.e. borderline cable), this test can many times trigger problems. |
| Comment by Peter Piela (Inactive) [ 30/Aug/12 ] |
|
This is on the only line in the modprobe.conf file that refers to lustre “options lnet networks=tcp0(eth3)” |
| Comment by Isaac Huang (Inactive) [ 31/Aug/12 ] |
|
Hi Peter, “options lnet networks=tcp0(eth3)” does NOT guarantee that Lustre traffic would go by eth3. Can you also show 'ifconfig' and 'route -n' outputs on that node? Also, there seemed to be several networks, 10.145.61.* 160.62.219.* and 192.168.255.*, can you please elaborate a bit on the network topology? Thanks! |
| Comment by Peter Piela (Inactive) [ 31/Aug/12 ] |
|
Hi Isaac, |
| Comment by Isaac Huang (Inactive) [ 04/Sep/12 ] |
|
“options lnet networks=tcp0(eth3)” tells the socklnd to use the IP address of the eth3 interface for its @tcp0 NID, and all passive and active TCP connections will be locally bound to this IP address. Consequently:
You can find out by doing writes from a compute node, and watch which interface's "TX bytes" are growing by running 'ifconfig' repeatedly. |
| Comment by Peter Piela (Inactive) [ 13/Sep/12 ] |
|
Isaac, I would like to return to one of the basic symptoms of this problem. When the problem occurs (OSTs become unavailable) the compute node has no problem pinging the storage server that contains the unavailable OSTs, but an lctl ping to the same node fails. It seems to me that with additional tracing or diagnostics we should be able to understand why the lctl ping is failing? Are there additional diagnostics we can enable in the client and server to analyze the problem? |
| Comment by Isaac Huang (Inactive) [ 13/Sep/12 ] |
|
Peter, Doug suggested in a previous comment to enable neterror console logging, which I believe would give additional diagnostics to nail down the lctl ping failure. Please just do a "echo +neterror > /proc/sys/lnet/printk" on the server and a couple of clients that had the lctl ping failure. This setting is not persistent and must be done after each reboot. Also, my hunch says it might be a duplicate of The lnet/ksocklnd modules have to be reloaded for it to come to effect. There's no requirement on order - you can do it on some clients first or servers first or whatever order that's most convenient. |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ] |
|
Peter, |
| Comment by John Fuchs-Chesney (Inactive) [ 15/Mar/14 ] |
|
Looks like we will not make any further progress on this issue. |