[LU-2429] easy to find bad client Created: 04/Dec/12  Updated: 23/Feb/13  Resolved: 23/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

lustre 1.8.8 RHEL5


Attachments: Text File 20121210_t2s007037.log     File 20121210_t2s007037_sysrq_t.log.tgz    
Issue Links:
Related
is related to LU-793 Reconnections should not be refused w... Resolved
Severity: 3
Rank (Obsolete): 5754

 Description   

we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish.

e.g.)
Nov 6 10:51:00 oss212 kernel: Lustre: 21280:0:(ldlm_lib.c:874:target_handle_connect()) LARGE01-OST004c: refuse reconnection from 6279e611-9d6b-3d6a-bab4-e76cf925282f@560@gni to 0xffff81043d807a00; still busy with 1 active RPCs
Nov 6 10:51:16 oss212 kernel: LustreError: 21337:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (107) req@ffff8106a3c46400 x1415646605273905/t0 o400><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1352166761 ref 1 fl Interpret:H/0/0 rc -107/0

Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back.

Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea..

Any good idea to easy find bad client when the above logs happen?



 Comments   
Comment by Johann Lombardi (Inactive) [ 05/Dec/12 ]

Ihara, the message actually prints the nid (i.e. 560@gni). Normally, such RPCs should be aborted after some time and the client should then be able to reconnect. Is it the case?

Comment by Shuichi Ihara (Inactive) [ 05/Dec/12 ]

Hi Johan,
So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained?
yes, it's aborted normally, but sometimes, it doesn't abort and the client can't reconnect forever.
I wonder if we can do force abort and skip waiting for this processing.

Comment by Bruno Faccini (Inactive) [ 05/Dec/12 ]

You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side.

But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

Comment by Shuichi Ihara (Inactive) [ 05/Dec/12 ]

Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

Comment by Bruno Faccini (Inactive) [ 05/Dec/12 ]

BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ??

Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

Comment by Johann Lombardi (Inactive) [ 05/Dec/12 ]

So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained?

It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting.

yes, it's aborted normally,

ok

but sometimes, it doesn't abort and the client can't reconnect forever.

That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck.

I wonder if we can do force abort and skip waiting for this processing.

I'm afraid that we can't

Comment by Andreas Dilger [ 05/Dec/12 ]

As Johann mentions, the client NID is in this message. It is also possible to grep for the client UUID (6279e611-9d6b-3d6a-bab4-e76cf925282f in this case) in /proc/fs/lustre/obdfilter/LARGE01-OST004c/exports/*/uuid.

Note that there is a bug open for the "still busy" problem (LU-793), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616). The last time I spoke with him about this he wasn't quite happy with the patch, but maybe someone else could look into fixing and landing it? I think this is a common and long-standing problem and it would be nice to fix it if possible.

Comment by Bruno Faccini (Inactive) [ 06/Dec/12 ]

We need to progress here and at least to try understand the real conditions when your problem occurs.
So can you at least provide us the syslogs, covering the problem time-frame, of one concerned Client and associated Server/OSS ??
Thank's.

Comment by Johann Lombardi (Inactive) [ 07/Dec/12 ]

Note that there is a bug open for the "still busy" problem (LU-793), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616).

While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara.

Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ]

I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general.
But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS?

# grep "still busy" 20121210_t2s007037.log 
Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs
Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs
Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs
Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs
Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs
...
Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ]

full OSS"s messages attached.

Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ]

full OSS"s messages attached.

Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ]

this is OSS's sysrq-t output that we got right now.

Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ]
 jbd2/dm-0-8   D ffff8101d86aa860     0 16413    247         16414 16412 (L-TLB)
  ffff8102dc6edb90 0000000000000046 0000000000000282 0000000000000008
  ffff8101b3c483c0 000000000000000a ffff81060d6a1860 ffff8101d86aa860
  0017bf15f0c46f86 0000000000000be8 ffff81060d6a1a48 0000000a0afb26b8
 Call Trace:
  [<ffffffff8006ece7>] do_gettimeofday+0x40/0x90
  [<ffffffff8005a40e>] getnstimeofday+0x10/0x29
  [<ffffffff80028bd3>] sync_page+0x0/0x42
  [<ffffffff800637de>] io_schedule+0x3f/0x67
  [<ffffffff80028c11>] sync_page+0x3e/0x42
  [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66
  [<ffffffff8003f9ab>] __lock_page+0x5e/0x64
  [<ffffffff800a34e5>] wake_bit_function+0x0/0x23
  [<ffffffff80047c5b>] pagevec_lookup_tag+0x1a/0x21
  [<ffffffff8001d035>] mpage_writepages+0x14f/0x37d
  [<ffffffff88a87bc0>] :ldiskfs:ldiskfs_writepage+0x0/0x3a0
  [<ffffffff800a34c0>] autoremove_wake_function+0x9/0x2e
  [<ffffffff8008d2a9>] __wake_up_common+0x3e/0x68
  [<ffffffff88a622b4>] :jbd2:jbd2_journal_commit_transaction+0x36c/0x1120
  [<ffffffff8004ad55>] try_to_del_timer_sync+0x7f/0x88
  [<ffffffff88a6623e>] :jbd2:kjournald2+0x9a/0x1ec
  [<ffffffff800a34b7>] autoremove_wake_function+0x0/0x2e
  [<ffffffff88a661a4>] :jbd2:kjournald2+0x0/0x1ec
  [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032654>] kthread+0xfe/0x132
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032556>] kthread+0x0/0x132
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

hm, this reminds me of https://bugzilla.lustre.org/show_bug.cgi?id=21406#c75 which can happen if we somehow leave dirty pages in the OSS page cache (which shouldn't be the case) and the jbd2 thread tries to flush them.

Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ]

This might be same problem? http://jira.whamcloud.com/browse/LU-1219
Also, data=writeback might help to prevent this kind of probem?

Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ]

This might be same problem? http://jira.whamcloud.com/browse/LU-1219

Yes, it looks similar.

Also, data=writeback might help to prevent this kind of probem?

Yes, although i really would like to understand how we can end up with dirty pages in the inode mapping ...

Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ]

Johann,
data=writeback on the standard ext3/4 filesystem, no guarantee of ordering. (sometimes, journal may commit before data flush) So, is data=writeback safe with the lustre? and no re-ordering even writeback mode is enalbed on OST/MDT?
https://bugzilla.lustre.org/show_bug.cgi?id=21406.. why this data=writeback mode wan't default option on the lustre even today?

Comment by Bruno Faccini (Inactive) [ 10/Dec/12 ]

BTW, LU-1219 is still waitig for the Alt+SysRq+T logs you provided there!!

Strange is that the SysRq output only shows 11 running tasks stacks fr your 12xCores OSS !! But this may come from the fact (option?) that the swapper/idle tasks stacks are not dumped ...

I agree with you Johann, task/pid 16413 is the one blocking all others, but don't you think there could be some issue on the disks/storage/back-end side ???

Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ]

Ihara, it is safe to use data=writeback since lustre already pushes data to disk before committing, so you already have the ordering guarantee.

Bruno, the stack trace shows that the jdb2 thread in charge of commit is waiting for some dirty pages to be flushed, which should never happen on the OSS. The issue is that we wait for commit with the pages locked, so there is a deadlock between the service threads and the jbd2 thread. Therefore, we should try to understand how we can end up with dirty pages in the page cache.

Comment by Bruno Faccini (Inactive) [ 11/Dec/12 ]

Ihara, do you think you can take an OSS crash-dump ?? Because event if "data=writeback" seems to be a good work-around candidate and works finally, we need to understand how we fall in such situation where the jbd2 thread finds dirty-pages to flush when it should not !!

Comment by Shuichi Ihara (Inactive) [ 11/Dec/12 ]

Bruno,
Unfortunately, we couldn't get crashdump.. you need same jbd2 stack situation, right? if so, hope we can get it when the same probem happens sooner.
any another ideas we can test before decide to change data=writeback?

Comment by Bruno Faccini (Inactive) [ 12/Dec/12 ]

No, I am afraid that only "data=writeback" can be thought as a work-around according to the problem you encounter. But again, it can only be used as a work-around and we need to understand your problem's root-cause because even running with it you may finally later end-up in an other hung situation ...

Comment by Bruno Faccini (Inactive) [ 28/Dec/12 ]

Hello Ihara,
Any news on this issue ??
Have you been able to apply the work-around and/or get a new crash-dump ??
Bruno.

Comment by Bruno Faccini (Inactive) [ 21/Jan/13 ]

Ihara, Any news ?? Please can you provide us with a status for this ticket ??

Comment by Shuichi Ihara (Inactive) [ 23/Jan/13 ]

Hi Bruno,

Sorry, delayed updates on this. We haven't seen same problem and even been able to reproduce problem since the final crash we did...

Comment by Bruno Faccini (Inactive) [ 28/Jan/13 ]

So do you agree to close this ticket ??

Comment by Shuichi Ihara (Inactive) [ 22/Feb/13 ]

if we hit same problem again at this site, I will reopen with new ticket, so, at this moment, please close this ticket.

Comment by Peter Jones [ 23/Feb/13 ]

ok thanks Ihara

Generated at Sat Feb 10 01:25:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.