[LU-2429] easy to find bad client - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
None
Environment:
lustre 1.8.8 RHEL5

Severity:
3
Rank (Obsolete):
5754

Description

we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish.

e.g.)
Nov 6 10:51:00 oss212 kernel: Lustre: 21280:0:(ldlm_lib.c:874:target_handle_connect()) LARGE01-OST004c: refuse reconnection from 6279e611-9d6b-3d6a-bab4-e76cf925282f@560@gni to 0xffff81043d807a00; still busy with 1 active RPCs
Nov 6 10:51:16 oss212 kernel: LustreError: 21337:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (~~107) req@ffff8106a3c46400 x1415646605273905/t0 o400~~><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1352166761 ref 1 fl Interpret:H/0/0 rc -107/0

Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back.

Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea..

Any good idea to easy find bad client when the above logs happen?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20121210_t2s007037_sysrq_t.log.tgz
150 kB
10/Dec/12 8:22 AM
20121210_t2s007037.log
241 kB
10/Dec/12 7:04 AM

Issue Links

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

Activity

[LU-2429] easy to find bad client

Johann Lombardi (Inactive) added a comment - 05/Dec/12 2:02 PM

So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained?

It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting.

yes, it's aborted normally,

ok

but sometimes, it doesn't abort and the client can't reconnect forever.

That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck.

I wonder if we can do force abort and skip waiting for this processing.

I'm afraid that we can't

Johann Lombardi (Inactive) added a comment - 05/Dec/12 2:02 PM So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained? It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting. yes, it's aborted normally, ok but sometimes, it doesn't abort and the client can't reconnect forever. That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck. I wonder if we can do force abort and skip waiting for this processing. I'm afraid that we can't

Bruno Faccini (Inactive) added a comment - 05/Dec/12 11:43 AM

BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ??

Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

Bruno Faccini (Inactive) added a comment - 05/Dec/12 11:43 AM BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ?? Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

Shuichi Ihara (Inactive) added a comment - 05/Dec/12 9:44 AM

Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

Shuichi Ihara (Inactive) added a comment - 05/Dec/12 9:44 AM Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

Bruno Faccini (Inactive) added a comment - 05/Dec/12 9:15 AM

You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side.

But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

Bruno Faccini (Inactive) added a comment - 05/Dec/12 9:15 AM You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side. But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

Shuichi Ihara (Inactive) added a comment - 05/Dec/12 9:12 AM

Hi Johan,
So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained?
yes, it's aborted normally, but sometimes, it doesn't abort and the client can't reconnect forever.
I wonder if we can do force abort and skip waiting for this processing.

Shuichi Ihara (Inactive) added a comment - 05/Dec/12 9:12 AM Hi Johan, So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained? yes, it's aborted normally, but sometimes, it doesn't abort and the client can't reconnect forever. I wonder if we can do force abort and skip waiting for this processing.

Johann Lombardi (Inactive) added a comment - 05/Dec/12 8:38 AM

Ihara, the message actually prints the nid (i.e. 560@gni). Normally, such RPCs should be aborted after some time and the client should then be able to reconnect. Is it the case?

Johann Lombardi (Inactive) added a comment - 05/Dec/12 8:38 AM Ihara, the message actually prints the nid (i.e. 560@gni). Normally, such RPCs should be aborted after some time and the client should then be able to reconnect. Is it the case?

easy to find bad client

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates