[LU-2429] easy to find bad client - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
None
Environment:
lustre 1.8.8 RHEL5

Severity:
3
Rank (Obsolete):
5754

Description

we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish.

e.g.)
Nov 6 10:51:00 oss212 kernel: Lustre: 21280:0:(ldlm_lib.c:874:target_handle_connect()) LARGE01-OST004c: refuse reconnection from 6279e611-9d6b-3d6a-bab4-e76cf925282f@560@gni to 0xffff81043d807a00; still busy with 1 active RPCs
Nov 6 10:51:16 oss212 kernel: LustreError: 21337:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (~~107) req@ffff8106a3c46400 x1415646605273905/t0 o400~~><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1352166761 ref 1 fl Interpret:H/0/0 rc -107/0

Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back.

Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea..

Any good idea to easy find bad client when the above logs happen?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20121210_t2s007037_sysrq_t.log.tgz
150 kB
10/Dec/12 8:22 AM
20121210_t2s007037.log
241 kB
10/Dec/12 7:04 AM

Issue Links

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

Activity

[LU-2429] easy to find bad client

Johann Lombardi (Inactive) added a comment - 10/Dec/12 2:48 PM

Ihara, it is safe to use data=writeback since lustre already pushes data to disk before committing, so you already have the ordering guarantee.

Bruno, the stack trace shows that the jdb2 thread in charge of commit is waiting for some dirty pages to be flushed, which should never happen on the OSS. The issue is that we wait for commit with the pages locked, so there is a deadlock between the service threads and the jbd2 thread. Therefore, we should try to understand how we can end up with dirty pages in the page cache.

Johann Lombardi (Inactive) added a comment - 10/Dec/12 2:48 PM Ihara, it is safe to use data=writeback since lustre already pushes data to disk before committing, so you already have the ordering guarantee. Bruno, the stack trace shows that the jdb2 thread in charge of commit is waiting for some dirty pages to be flushed, which should never happen on the OSS. The issue is that we wait for commit with the pages locked, so there is a deadlock between the service threads and the jbd2 thread. Therefore, we should try to understand how we can end up with dirty pages in the page cache.

Bruno Faccini (Inactive) added a comment - 10/Dec/12 1:41 PM

BTW, ~~LU-1219~~ is still waitig for the Alt+SysRq+T logs you provided there!!

Strange is that the SysRq output only shows 11 running tasks stacks fr your 12xCores OSS !! But this may come from the fact (option?) that the swapper/idle tasks stacks are not dumped ...

I agree with you Johann, task/pid 16413 is the one blocking all others, but don't you think there could be some issue on the disks/storage/back-end side ???

Bruno Faccini (Inactive) added a comment - 10/Dec/12 1:41 PM BTW, LU-1219 is still waitig for the Alt+SysRq+T logs you provided there!! Strange is that the SysRq output only shows 11 running tasks stacks fr your 12xCores OSS !! But this may come from the fact (option?) that the swapper/idle tasks stacks are not dumped ... I agree with you Johann, task/pid 16413 is the one blocking all others, but don't you think there could be some issue on the disks/storage/back-end side ???

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 12:53 PM

Johann,
data=writeback on the standard ext3/4 filesystem, no guarantee of ordering. (sometimes, journal may commit before data flush) So, is data=writeback safe with the lustre? and no re-ordering even writeback mode is enalbed on OST/MDT?
https://bugzilla.lustre.org/show_bug.cgi?id=21406.. why this data=writeback mode wan't default option on the lustre even today?

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 12:53 PM Johann, data=writeback on the standard ext3/4 filesystem, no guarantee of ordering. (sometimes, journal may commit before data flush) So, is data=writeback safe with the lustre? and no re-ordering even writeback mode is enalbed on OST/MDT? https://bugzilla.lustre.org/show_bug.cgi?id=21406 .. why this data=writeback mode wan't default option on the lustre even today?

Johann Lombardi (Inactive) added a comment - 10/Dec/12 10:56 AM

This might be same problem? http://jira.whamcloud.com/browse/LU-1219

Yes, it looks similar.

Also, data=writeback might help to prevent this kind of probem?

Yes, although i really would like to understand how we can end up with dirty pages in the inode mapping ...

Johann Lombardi (Inactive) added a comment - 10/Dec/12 10:56 AM This might be same problem? http://jira.whamcloud.com/browse/LU-1219 Yes, it looks similar. Also, data=writeback might help to prevent this kind of probem? Yes, although i really would like to understand how we can end up with dirty pages in the inode mapping ...

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 10:34 AM

This might be same problem? http://jira.whamcloud.com/browse/LU-1219
Also, data=writeback might help to prevent this kind of probem?

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 10:34 AM This might be same problem? http://jira.whamcloud.com/browse/LU-1219 Also, data=writeback might help to prevent this kind of probem?

Johann Lombardi (Inactive) added a comment - 10/Dec/12 9:02 AM - edited

 jbd2/dm-0-8   D ffff8101d86aa860     0 16413    247         16414 16412 (L-TLB)
  ffff8102dc6edb90 0000000000000046 0000000000000282 0000000000000008
  ffff8101b3c483c0 000000000000000a ffff81060d6a1860 ffff8101d86aa860
  0017bf15f0c46f86 0000000000000be8 ffff81060d6a1a48 0000000a0afb26b8
 Call Trace:
  [<ffffffff8006ece7>] do_gettimeofday+0x40/0x90
  [<ffffffff8005a40e>] getnstimeofday+0x10/0x29
  [<ffffffff80028bd3>] sync_page+0x0/0x42
  [<ffffffff800637de>] io_schedule+0x3f/0x67
  [<ffffffff80028c11>] sync_page+0x3e/0x42
  [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66
  [<ffffffff8003f9ab>] __lock_page+0x5e/0x64
  [<ffffffff800a34e5>] wake_bit_function+0x0/0x23
  [<ffffffff80047c5b>] pagevec_lookup_tag+0x1a/0x21
  [<ffffffff8001d035>] mpage_writepages+0x14f/0x37d
  [<ffffffff88a87bc0>] :ldiskfs:ldiskfs_writepage+0x0/0x3a0
  [<ffffffff800a34c0>] autoremove_wake_function+0x9/0x2e
  [<ffffffff8008d2a9>] __wake_up_common+0x3e/0x68
  [<ffffffff88a622b4>] :jbd2:jbd2_journal_commit_transaction+0x36c/0x1120
  [<ffffffff8004ad55>] try_to_del_timer_sync+0x7f/0x88
  [<ffffffff88a6623e>] :jbd2:kjournald2+0x9a/0x1ec
  [<ffffffff800a34b7>] autoremove_wake_function+0x0/0x2e
  [<ffffffff88a661a4>] :jbd2:kjournald2+0x0/0x1ec
  [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032654>] kthread+0xfe/0x132
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032556>] kthread+0x0/0x132
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

hm, this reminds me of https://bugzilla.lustre.org/show_bug.cgi?id=21406#c75 which can happen if we somehow leave dirty pages in the OSS page cache (which shouldn't be the case) and the jbd2 thread tries to flush them.

Johann Lombardi (Inactive) added a comment - 10/Dec/12 9:02 AM - edited jbd2/dm-0-8 D ffff8101d86aa860 0 16413 247 16414 16412 (L-TLB) ffff8102dc6edb90 0000000000000046 0000000000000282 0000000000000008 ffff8101b3c483c0 000000000000000a ffff81060d6a1860 ffff8101d86aa860 0017bf15f0c46f86 0000000000000be8 ffff81060d6a1a48 0000000a0afb26b8 Call Trace: [<ffffffff8006ece7>] do_gettimeofday+0x40/0x90 [<ffffffff8005a40e>] getnstimeofday+0x10/0x29 [<ffffffff80028bd3>] sync_page+0x0/0x42 [<ffffffff800637de>] io_schedule+0x3f/0x67 [<ffffffff80028c11>] sync_page+0x3e/0x42 [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66 [<ffffffff8003f9ab>] __lock_page+0x5e/0x64 [<ffffffff800a34e5>] wake_bit_function+0x0/0x23 [<ffffffff80047c5b>] pagevec_lookup_tag+0x1a/0x21 [<ffffffff8001d035>] mpage_writepages+0x14f/0x37d [<ffffffff88a87bc0>] :ldiskfs:ldiskfs_writepage+0x0/0x3a0 [<ffffffff800a34c0>] autoremove_wake_function+0x9/0x2e [<ffffffff8008d2a9>] __wake_up_common+0x3e/0x68 [<ffffffff88a622b4>] :jbd2:jbd2_journal_commit_transaction+0x36c/0x1120 [<ffffffff8004ad55>] try_to_del_timer_sync+0x7f/0x88 [<ffffffff88a6623e>] :jbd2:kjournald2+0x9a/0x1ec [<ffffffff800a34b7>] autoremove_wake_function+0x0/0x2e [<ffffffff88a661a4>] :jbd2:kjournald2+0x0/0x1ec [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032654>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032556>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 hm, this reminds me of https://bugzilla.lustre.org/show_bug.cgi?id=21406#c75 which can happen if we somehow leave dirty pages in the OSS page cache (which shouldn't be the case) and the jbd2 thread tries to flush them.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 8:22 AM

this is OSS's sysrq-t output that we got right now.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 8:22 AM this is OSS's sysrq-t output that we got right now.

Johann Lombardi (Inactive) added a comment - 10/Dec/12 8:09 AM

full OSS"s messages attached.

Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

Johann Lombardi (Inactive) added a comment - 10/Dec/12 8:09 AM full OSS"s messages attached. Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:04 AM

full OSS"s messages attached.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:04 AM full OSS"s messages attached.

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:02 AM

I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general.
But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS?

# grep "still busy" 20121210_t2s007037.log 
Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs
Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs
Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs
Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs
Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs
...

Shuichi Ihara (Inactive) added a comment - 10/Dec/12 7:02 AM I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general. But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS? # grep "still busy" 20121210_t2s007037.log Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs ...

Johann Lombardi (Inactive) added a comment - 07/Dec/12 2:21 AM

Note that there is a bug open for the "still busy" problem (~~LU-793~~), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616).

While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara.

Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

Johann Lombardi (Inactive) added a comment - 07/Dec/12 2:21 AM Note that there is a bug open for the "still busy" problem ( LU-793 ), and I believe Oleg had a patch for this ( http://review.whamcloud.com/1616 ). While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara. Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 04/Dec/12 10:37 PM

Updated:: 23/Feb/13 1:21 AM

Resolved:: 23/Feb/13 1:21 AM