[LU-2429] easy to find bad client Created: 04/Dec/12 Updated: 23/Feb/13 Resolved: 23/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 1.8.8 RHEL5 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5754 | ||||||||
| Description |
|
we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish. e.g.) Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back. Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea.. Any good idea to easy find bad client when the above logs happen? |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 05/Dec/12 ] |
|
Ihara, the message actually prints the nid (i.e. 560@gni). Normally, such RPCs should be aborted after some time and the client should then be able to reconnect. Is it the case? |
| Comment by Shuichi Ihara (Inactive) [ 05/Dec/12 ] |
|
Hi Johan, |
| Comment by Bruno Faccini (Inactive) [ 05/Dec/12 ] |
|
You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side. But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools. |
| Comment by Shuichi Ihara (Inactive) [ 05/Dec/12 ] |
|
Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect. |
| Comment by Bruno Faccini (Inactive) [ 05/Dec/12 ] |
|
BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ?? Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ??? |
| Comment by Johann Lombardi (Inactive) [ 05/Dec/12 ] |
It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting.
ok
That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck.
I'm afraid that we can't |
| Comment by Andreas Dilger [ 05/Dec/12 ] |
|
As Johann mentions, the client NID is in this message. It is also possible to grep for the client UUID (6279e611-9d6b-3d6a-bab4-e76cf925282f in this case) in /proc/fs/lustre/obdfilter/LARGE01-OST004c/exports/*/uuid. Note that there is a bug open for the "still busy" problem ( |
| Comment by Bruno Faccini (Inactive) [ 06/Dec/12 ] |
|
We need to progress here and at least to try understand the real conditions when your problem occurs. |
| Comment by Johann Lombardi (Inactive) [ 07/Dec/12 ] |
While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara. Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid. |
| Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ] |
|
I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general. # grep "still busy" 20121210_t2s007037.log Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs ... |
| Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ] |
|
full OSS"s messages attached. |
| Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ] |
Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)? |
| Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ] |
|
this is OSS's sysrq-t output that we got right now. |
| Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ] |
jbd2/dm-0-8 D ffff8101d86aa860 0 16413 247 16414 16412 (L-TLB) ffff8102dc6edb90 0000000000000046 0000000000000282 0000000000000008 ffff8101b3c483c0 000000000000000a ffff81060d6a1860 ffff8101d86aa860 0017bf15f0c46f86 0000000000000be8 ffff81060d6a1a48 0000000a0afb26b8 Call Trace: [<ffffffff8006ece7>] do_gettimeofday+0x40/0x90 [<ffffffff8005a40e>] getnstimeofday+0x10/0x29 [<ffffffff80028bd3>] sync_page+0x0/0x42 [<ffffffff800637de>] io_schedule+0x3f/0x67 [<ffffffff80028c11>] sync_page+0x3e/0x42 [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66 [<ffffffff8003f9ab>] __lock_page+0x5e/0x64 [<ffffffff800a34e5>] wake_bit_function+0x0/0x23 [<ffffffff80047c5b>] pagevec_lookup_tag+0x1a/0x21 [<ffffffff8001d035>] mpage_writepages+0x14f/0x37d [<ffffffff88a87bc0>] :ldiskfs:ldiskfs_writepage+0x0/0x3a0 [<ffffffff800a34c0>] autoremove_wake_function+0x9/0x2e [<ffffffff8008d2a9>] __wake_up_common+0x3e/0x68 [<ffffffff88a622b4>] :jbd2:jbd2_journal_commit_transaction+0x36c/0x1120 [<ffffffff8004ad55>] try_to_del_timer_sync+0x7f/0x88 [<ffffffff88a6623e>] :jbd2:kjournald2+0x9a/0x1ec [<ffffffff800a34b7>] autoremove_wake_function+0x0/0x2e [<ffffffff88a661a4>] :jbd2:kjournald2+0x0/0x1ec [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032654>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032556>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 hm, this reminds me of https://bugzilla.lustre.org/show_bug.cgi?id=21406#c75 which can happen if we somehow leave dirty pages in the OSS page cache (which shouldn't be the case) and the jbd2 thread tries to flush them. |
| Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ] |
|
This might be same problem? http://jira.whamcloud.com/browse/LU-1219 |
| Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ] |
Yes, it looks similar.
Yes, although i really would like to understand how we can end up with dirty pages in the inode mapping ... |
| Comment by Shuichi Ihara (Inactive) [ 10/Dec/12 ] |
|
Johann, |
| Comment by Bruno Faccini (Inactive) [ 10/Dec/12 ] |
|
BTW, Strange is that the SysRq output only shows 11 running tasks stacks fr your 12xCores OSS !! But this may come from the fact (option?) that the swapper/idle tasks stacks are not dumped ... I agree with you Johann, task/pid 16413 is the one blocking all others, but don't you think there could be some issue on the disks/storage/back-end side ??? |
| Comment by Johann Lombardi (Inactive) [ 10/Dec/12 ] |
|
Ihara, it is safe to use data=writeback since lustre already pushes data to disk before committing, so you already have the ordering guarantee. Bruno, the stack trace shows that the jdb2 thread in charge of commit is waiting for some dirty pages to be flushed, which should never happen on the OSS. The issue is that we wait for commit with the pages locked, so there is a deadlock between the service threads and the jbd2 thread. Therefore, we should try to understand how we can end up with dirty pages in the page cache. |
| Comment by Bruno Faccini (Inactive) [ 11/Dec/12 ] |
|
Ihara, do you think you can take an OSS crash-dump ?? Because event if "data=writeback" seems to be a good work-around candidate and works finally, we need to understand how we fall in such situation where the jbd2 thread finds dirty-pages to flush when it should not !! |
| Comment by Shuichi Ihara (Inactive) [ 11/Dec/12 ] |
|
Bruno, |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/12 ] |
|
No, I am afraid that only "data=writeback" can be thought as a work-around according to the problem you encounter. But again, it can only be used as a work-around and we need to understand your problem's root-cause because even running with it you may finally later end-up in an other hung situation ... |
| Comment by Bruno Faccini (Inactive) [ 28/Dec/12 ] |
|
Hello Ihara, |
| Comment by Bruno Faccini (Inactive) [ 21/Jan/13 ] |
|
Ihara, Any news ?? Please can you provide us with a status for this ticket ?? |
| Comment by Shuichi Ihara (Inactive) [ 23/Jan/13 ] |
|
Hi Bruno, Sorry, delayed updates on this. We haven't seen same problem and even been able to reproduce problem since the final crash we did... |
| Comment by Bruno Faccini (Inactive) [ 28/Jan/13 ] |
|
So do you agree to close this ticket ?? |
| Comment by Shuichi Ihara (Inactive) [ 22/Feb/13 ] |
|
if we hit same problem again at this site, I will reopen with new ticket, so, at this moment, please close this ticket. |
| Comment by Peter Jones [ 23/Feb/13 ] |
|
ok thanks Ihara |