[LU-2429] easy to find bad client - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
None
Environment:
lustre 1.8.8 RHEL5

Severity:
3
Rank (Obsolete):
5754

Description

we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish.

e.g.)
Nov 6 10:51:00 oss212 kernel: Lustre: 21280:0:(ldlm_lib.c:874:target_handle_connect()) LARGE01-OST004c: refuse reconnection from 6279e611-9d6b-3d6a-bab4-e76cf925282f@560@gni to 0xffff81043d807a00; still busy with 1 active RPCs
Nov 6 10:51:16 oss212 kernel: LustreError: 21337:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (~~107) req@ffff8106a3c46400 x1415646605273905/t0 o400~~><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1352166761 ref 1 fl Interpret:H/0/0 rc -107/0

Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back.

Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea..

Any good idea to easy find bad client when the above logs happen?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20121210_t2s007037_sysrq_t.log.tgz
150 kB
10/Dec/12 8:22 AM
20121210_t2s007037.log
241 kB
10/Dec/12 7:04 AM

Issue Links

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

Activity

[LU-2429] easy to find bad client

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 8:22 AM

this is OSS's sysrq-t output that we got right now.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 8:22 AM this is OSS's sysrq-t output that we got right now.

Johann Lombardi (Inactive) added a comment - 10/Dec/12 8:09 AM

full OSS"s messages attached.

Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

Johann Lombardi (Inactive) added a comment - 10/Dec/12 8:09 AM full OSS"s messages attached. Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:04 AM

full OSS"s messages attached.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:04 AM full OSS"s messages attached.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:02 AM

I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general.
But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS?

# grep "still busy" 20121210_t2s007037.log 
Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs
Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs
Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs
Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs
Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs
...

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:02 AM I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general. But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS? # grep "still busy" 20121210_t2s007037.log Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs ...

Johann Lombardi (Inactive) added a comment - 07/Dec/12 2:21 AM

Note that there is a bug open for the "still busy" problem (~~LU-793~~), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616).

While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara.

Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

Johann Lombardi (Inactive) added a comment - 07/Dec/12 2:21 AM Note that there is a bug open for the "still busy" problem ( LU-793 ), and I believe Oleg had a patch for this ( http://review.whamcloud.com/1616 ). While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara. Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

Bruno Faccini (Inactive) added a comment - 06/Dec/12 5:31 PM

We need to progress here and at least to try understand the real conditions when your problem occurs.
So can you at least provide us the syslogs, covering the problem time-frame, of one concerned Client and associated Server/OSS ??
Thank's.

Bruno Faccini (Inactive) added a comment - 06/Dec/12 5:31 PM We need to progress here and at least to try understand the real conditions when your problem occurs. So can you at least provide us the syslogs, covering the problem time-frame, of one concerned Client and associated Server/OSS ?? Thank's.

Andreas Dilger added a comment - 05/Dec/12 2:05 PM

As Johann mentions, the client NID is in this message. It is also possible to grep for the client UUID (6279e611-9d6b-3d6a-bab4-e76cf925282f in this case) in /proc/fs/lustre/obdfilter/LARGE01-OST004c/exports/*/uuid.

Note that there is a bug open for the "still busy" problem (~~LU-793~~), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616). The last time I spoke with him about this he wasn't quite happy with the patch, but maybe someone else could look into fixing and landing it? I think this is a common and long-standing problem and it would be nice to fix it if possible.

Andreas Dilger added a comment - 05/Dec/12 2:05 PM As Johann mentions, the client NID is in this message. It is also possible to grep for the client UUID (6279e611-9d6b-3d6a-bab4-e76cf925282f in this case) in /proc/fs/lustre/obdfilter/LARGE01-OST004c/exports/*/uuid. Note that there is a bug open for the "still busy" problem ( LU-793 ), and I believe Oleg had a patch for this ( http://review.whamcloud.com/1616 ). The last time I spoke with him about this he wasn't quite happy with the patch, but maybe someone else could look into fixing and landing it? I think this is a common and long-standing problem and it would be nice to fix it if possible.

Johann Lombardi (Inactive) added a comment - 05/Dec/12 2:02 PM

So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained?

It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting.

yes, it's aborted normally,

but sometimes, it doesn't abort and the client can't reconnect forever.

That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck.

I wonder if we can do force abort and skip waiting for this processing.

I'm afraid that we can't

Johann Lombardi (Inactive) added a comment - 05/Dec/12 2:02 PM So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained? It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting. yes, it's aborted normally, ok but sometimes, it doesn't abort and the client can't reconnect forever. That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck. I wonder if we can do force abort and skip waiting for this processing. I'm afraid that we can't

Bruno Faccini (Inactive) added a comment - 05/Dec/12 11:43 AM

BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ??

Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

Bruno Faccini (Inactive) added a comment - 05/Dec/12 11:43 AM BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ?? Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

Shuichi Ihara (Inactive) added a comment - 05/Dec/12 9:44 AM

Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

Shuichi Ihara (Inactive) added a comment - 05/Dec/12 9:44 AM Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

Bruno Faccini (Inactive) added a comment - 05/Dec/12 9:15 AM

You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side.

But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

Bruno Faccini (Inactive) added a comment - 05/Dec/12 9:15 AM You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side. But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 04/Dec/12 10:37 PM

Updated:: 23/Feb/13 1:21 AM

Resolved:: 23/Feb/13 1:21 AM