[LU-2779] LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page))) Created: 07/Feb/13 Updated: 07/Aug/15 Resolved: 25/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | sequoia, topsequoia | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 6732 | ||||||||||||||||
| Description |
|
We hit this on Sequoia during shutdown time, I don't see an existing bug open for this crash either: 2013-02-01 14:35:53.812430 {R4-llnl} [bgqadmin]{5}.3.1:
2013-02-01 14:35:53.812843 {R4-llnl} [bgqadmin]{5}.3.1: Broadcast message from root@seqio262-ib0
2013-02-01 14:35:53.813165 {R4-llnl} [bgqadmin]{5}.3.1: (unknown) at 14:35 ...
2013-02-01 14:35:53.813752 {R4-llnl} [bgqadmin]{5}.3.1: The system is going down for halt NOW!
2013-02-01 14:35:53.814093 {R4-llnl} [bgqadmin]{5}.2.3: Stopping Common I/O Services: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed:
2013-02-01 14:35:53.814429 {R4-llnl} [bgqadmin]{5}.2.3: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) LBUG
2013-02-01 14:35:53.814746 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
2013-02-01 14:35:53.815076 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b40] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-02-01 14:35:53.815397 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392bf0] [8000000000ab0c88] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
2013-02-01 14:35:53.815717 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab1450] .lbug_with_loc+0x50/0xc0 [libcfs]
2013-02-01 14:35:53.816042 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
2013-02-01 14:35:53.816392 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
2013-02-01 14:35:53.816732 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
2013-02-01 14:35:53.817047 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
2013-02-01 14:35:53.817363 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
2013-02-01 14:35:53.817877 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
2013-02-01 14:35:53.818248 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
2013-02-01 14:35:53.818565 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
2013-02-01 14:35:53.818901 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
2013-02-01 14:35:53.819221 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
2013-02-01 14:35:53.819535 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
2013-02-01 14:35:53.819908 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
2013-02-01 14:35:53.820387 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
2013-02-01 14:35:53.820707 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-01 14:35:53.821021 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-01 14:35:53.821337 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-01 14:35:53.821652 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-01 14:35:53.821966 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
2013-02-01 14:35:53.822282 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
2013-02-01 14:35:53.822595 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
2013-02-01 14:35:53.822909 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
2013-02-01 14:35:53.823222 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c
2013-02-01 14:35:53.823534 {R4-llnl} [bgqadmin]{5}.2.3: Kernel panic - not syncing: LBUG
2013-02-01 14:35:53.823844 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
2013-02-01 14:35:53.824153 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-02-01 14:35:53.824466 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392c10] [c000000000431ef4] .panic+0x80/0x1ac
2013-02-01 14:35:53.824776 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab14b0] .lbug_with_loc+0xb0/0xc0 [libcfs]
2013-02-01 14:35:53.825089 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
2013-02-01 14:35:53.825401 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
2013-02-01 14:35:53.825721 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
2013-02-01 14:35:53.826045 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
2013-02-01 14:35:53.826358 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
2013-02-01 14:35:53.826670 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
2013-02-01 14:35:53.826982 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
2013-02-01 14:35:53.827295 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
2013-02-01 14:35:53.827608 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
2013-02-01 14:35:53.827920 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
2013-02-01 14:35:53.828232 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
2013-02-01 14:35:53.828649 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
2013-02-01 14:35:53.829092 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
2013-02-01 14:35:53.829361 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-01 14:35:53.829629 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-01 14:35:53.829892 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-01 14:35:53.830155 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-01 14:35:53.830420 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
2013-02-01 14:35:53.830686 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
2013-02-01 14:35:53.830952 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
2013-02-01 14:35:53.831217 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
2013-02-01 14:35:53.831483 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c
|
| Comments |
| Comment by Bruno Faccini (Inactive) [ 08/Feb/13 ] |
|
This Assert looks familiar to me since I triggered it during Will try to 1st do some historical research to understand the conditions leading to this problem. |
| Comment by Prakash Surya (Inactive) [ 11/Feb/13 ] |
|
This might end up being related to, or a duplicate of, For example, I see these messages: 2013-02-04 10:36:46.046647 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:329:ptlrpc_invalidate_import()) ls1-OST0132_UUID: rc = -110 waiting for callback (1 != 0)
2013-02-04 10:36:46.047324 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:329:ptlrpc_invalidate_import()) Skipped 91 previous similar messages
2013-02-04 10:36:46.047897 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:355:ptlrpc_invalidate_import()) @@@ still on sending list req@c0000002818f5000 x1425746435123190/t0(0) o101->ls1-OST0132-osc-c0000003e98eb000@172.20.2.106@o2ib500:28/4 lens 328/368 e 0 to 0 dl 1359967526 ref 1 fl Interpret:RE/0/0 rc -5/301
2013-02-04 10:36:46.048227 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:355:ptlrpc_invalidate_import()) Skipped 98 previous similar messages
2013-02-04 10:36:46.048870 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:371:ptlrpc_invalidate_import()) ls1-OST0132_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.
2013-02-04 10:36:46.049450 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:371:ptlrpc_invalidate_import()) Skipped 101 previous similar messages
And many other network/recovery related messages in the logs for hours before the LBUG was triggered. |
| Comment by Christopher Morrone [ 12/Feb/13 ] |
|
We've been hitting this more frequently recently, on login nodes in particular. |
| Comment by Prakash Surya (Inactive) [ 12/Feb/13 ] |
|
Chris, do we see this in the absence of |
| Comment by Prakash Surya (Inactive) [ 12/Feb/13 ] |
|
Jinshan, when you get some time, can you look over this ticket? We're constantly hitting this in production, as of the past couple days. |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ] |
|
I'm looking at this one. |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ] |
|
I have known the problem, I will work out a patch tomorrow. |
| Comment by Bruno Faccini (Inactive) [ 13/Feb/13 ] |
|
Sorry to be late on this but it took me sometimes to retrieve the infos from the times I hit this bug weeks ago ... In fact I have already seen it when I tried to shutdown nodes with hung threads with the following kind of stacks I was able to dump : ============================================================================================== Nov 27 01:09:17 client-15 kernel: reproducer S 0000000000000001 0 8231 1 0x00000080 Nov 27 01:09:17 client-15 kernel: ffff8802e399d538 0000000000000082 ffff8802e399d538 ffffffff814fea8a Nov 27 01:09:17 client-15 kernel: ffff8802e399d4b8 0000000000000286 000000000000001c 00000000000000ff Nov 27 01:09:17 client-15 kernel: ffff88032f0e3af8 ffff8802e399dfd8 000000000000fb88 ffff88032f0e3af8 Nov 27 01:09:17 client-15 kernel: Call Trace: Nov 27 01:09:17 client-15 kernel: [<ffffffff814fea8a>] ? schedule_timeout+0x19a/0x2e0 Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e0d77e>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 27 01:09:17 client-15 kernel: [<ffffffffa05f6d27>] osc_extent_wait+0x5f7/0x6c0 [osc] Nov 27 01:09:17 client-15 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 Nov 27 01:09:17 client-15 kernel: [<ffffffffa05f714c>] osc_cache_wait_range+0x35c/0x9a0 [osc] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f85a3e>] ? cl_object_put+0xe/0x10 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa05f38b8>] ? osc_io_unplug0+0x748/0x12a0 [osc] Nov 27 01:09:17 client-15 kernel: [<ffffffff8127ce66>] ? vsnprintf+0x2b6/0x5f0 Nov 27 01:09:17 client-15 kernel: [<ffffffffa05fc731>] osc_cache_writeback_range+0xd61/0x1160 [osc] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e0c27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e22d97>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e222d2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f8709d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e25cbb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs] Nov 27 01:09:17 client-15 kernel: [<ffffffffa05e40f6>] osc_lock_flush+0x86/0x200 [osc] Nov 27 01:09:17 client-15 kernel: [<ffffffffa05e4357>] osc_lock_cancel+0xe7/0x1c0 [osc] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f8d0a5>] cl_lock_cancel0+0x75/0x160 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f8ece6>] cl_lock_hold_release+0x1c6/0x2b0 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f90837>] cl_lock_unhold+0x37/0x130 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa067e5f8>] lov_sublock_release+0x1a8/0x2a0 [lov] Nov 27 01:09:17 client-15 kernel: [<ffffffffa06811fb>] lov_lock_enqueue+0x2cb/0x830 [lov] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f9149c>] cl_enqueue_try+0xfc/0x310 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f9295d>] cl_enqueue_locked+0x6d/0x210 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f9363e>] cl_lock_request+0x7e/0x280 [obdclass] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0be3deb>] cl_glimpse_lock+0x17b/0x4a0 [lustre] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0be4677>] cl_glimpse_size0+0x187/0x190 [lustre] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0ba0ee2>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e22134>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0ba0ff9>] ll_getattr_it+0x49/0x170 [lustre] Nov 27 01:09:17 client-15 kernel: [<ffffffffa0ba1157>] ll_getattr+0x37/0x40 [lustre] Nov 27 01:09:17 client-15 kernel: [<ffffffff812143d3>] ? security_inode_getattr+0x23/0x30 Nov 27 01:09:17 client-15 kernel: [<ffffffff81180571>] vfs_getattr+0x51/0x80 Nov 27 01:09:17 client-15 kernel: [<ffffffff8118082f>] vfs_fstat+0x3f/0x60 Nov 27 01:09:17 client-15 kernel: [<ffffffff81180874>] sys_newfstat+0x24/0x40 Nov 27 01:09:17 client-15 kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 Nov 27 01:09:17 client-15 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b ============================================================================================== before node crashed during shutdown within the same task context and with the following stack which is the same you got : LustreError: 8231:0:(cl_lock.c:1960:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed: LustreError: 8231:0:(cl_lock.c:1960:discard_cb()) LBUG [ OK Pid: 8231, comm: reproducer Call Trace: [<ffffffffa0e0c905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0e0cf17>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0f8f66c>] discard_cb+0x1ec/0x1f0 [obdclass] [<ffffffffa0f8c864>] cl_page_gang_lookup+0x1f4/0x400 [obdclass] [<ffffffffa0f8f480>] ? discard_cb+0x0/0x1f0 [obdclass] [<ffffffffa0f8f480>] ? discard_cb+0x0/0x1f0 [obdclass] [<ffffffffa0f8f33e>] cl_lock_discard_pages+0x11e/0x1f0 [obdclass] [<ffffffffa05e4180>] osc_lock_flush+0x110/0x200 [osc] [<ffffffffa05e4357>] osc_lock_cancel+0xe7/0x1c0 [osc] [<ffffffffa0f8d0a5>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa0f8ece6>] cl_lock_hold_release+0x1c6/0x2b0 [obdclass] [<ffffffffa0f90837>] cl_lock_unhold+0x37/0x130 [obdclass] [<ffffffffa067e5f8>] lov_sublock_release+0x1a8/0x2a0 [lov] [<ffffffffa06811fb>] lov_lock_enqueue+0x2cb/0x830 [lov] [<ffffffffa0f9149c>] cl_enqueue_try+0xfc/0x310 [obdclass] [<ffffffffa0f9295d>] cl_enqueue_locked+0x6d/0x210 [obdclass] [<ffffffffa0f9363e>] cl_lock_request+0x7e/0x280 [obdclass] [<ffffffffa0be3deb>] cl_glimpse_lock+0x17b/0x4a0 [lustre] [<ffffffffa0be4677>] cl_glimpse_size0+0x187/0x190 [lustre] [<ffffffffa0ba0ee2>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0e22134>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] [<ffffffffa0ba0ff9>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0ba1157>] ll_getattr+0x37/0x40 [lustre] [<ffffffff812143d3>] ? security_inode_getattr+0x23/0x30 [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff8118082f>] vfs_fstat+0x3f/0x60 [<ffffffff81180874>] sys_newfstat+0x24/0x40 [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b and I confirm that there was also recent msgs, like you have, claiming for "unregistering" RPCs and possible sluggish network. Can you confirm if you have the same kind of stacks for hung threads in the crash-dumps from previous occurences ? If yes, Jinshan, could it be possible that we handle again pages that should have been already discarded but are still stuck (orphaned?) awaiting write-back ?? BTW, I got this LBUG when trying to reproduce Is it possible for you to run And also, it would help to have "dlmtrace", "cache" and "page" debug flags set in addition on Client-nodes likely to reproduce. |
| Comment by Bruno Faccini (Inactive) [ 13/Feb/13 ] |
|
Jinshan, just some thoughts after re-reading source code involved and talking about it with Johann, don't you think it could be much safer to call cl_page_own() BEFORE any ASSERTs in discard_cb() ?? Seems to me that this will avoid situations where we exited with error (timeout?) from osc_cache_writeback_range(), which is not tested in osc_lock_flush(), it will then call cl_lock_discard_pages()/cl_page_gang_lookup()/discard_cb() and re-parse same pages and thus find some to trigger assert. |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ] |
|
Hi Bruno, you're right about the code. Obviously the process waiting for the IO to finish received a signal from init because the system is shutting down so yes, this problem can be fixed by taking away this LASSERT - this is what I planned to do last right. However, the process will still have to wait page writeback to finish at cl_page_own(), and it will hang forever if the OST is already shutdown at the moment. Therefore you have to hit the power button to shut it down. This led me to think how to shutdown a client node gracefully. Obviously we should notify ptlrpcd threads for this event, then we should abort all of the write RPCs with error, then waiting processes can be notified and mount point can be cleaned up. This is out of the scope of this issue. BTW, I pushed a new patch for |
| Comment by Prakash Surya (Inactive) [ 13/Feb/13 ] |
|
I also want to note, that we hit this on the login nodes without a manual reboot or restart. There are traces of network issues (perhaps Bruno, I can probably get some time to run your reproducer, but I'm not sure it will be enlightening in any way. Jinshan, I don't see an updated patch set to http://review.whamcloud.com/5208 yet (still patch set 2), have you pushed the new patch somewhere else? |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ] |
|
The only path I can think of right now is that osc_extent_wait() is interrupted so that it can't wait for the IO to finish. Can you try to reproduce it with DLMTRACE and CACHE enabled on the client side? |
| Comment by Prakash Surya (Inactive) [ 13/Feb/13 ] |
|
I can try. Although, it's not dumping a lustre debug log when it ASSERTS, so I need to remind myself if it is supposed to or not. Also, it's hitting it without any messages to the console in nearly two hours prior to the crash. So maybe this isn't as closely related to evictions as I previously thought. |
| Comment by Prakash Surya (Inactive) [ 13/Feb/13 ] |
|
I do see a couple messages from osc_extent_wait nearly two hours before one specific crash: 2013-02-13 01:45:43 LustreError: 47189:0:(osc_cache.c:896:osc_extent_wait()) extent c000000c3d8c2990@{[0 -> 0/15], [3|0|+|rpc|wihY|c000000c08582450], [65536|1|+|-|c000000bf0414730|16|c000000f53444090]} lsa-OST0101-osc-c000000f544e2180: wait ext to 0 timedout, recovery in progress?
2013-02-13 01:45:48 LustreError: 3843:0:(osc_cache.c:896:osc_extent_wait()) extent c000000c3d8c6158@{[0 -> 0/15], [3|0|+|rpc|wihY|c000000c08581a30], [65536|1|+|-|c000000bf041fe10|16|c000000f37c8b4e0]} lsa-OST00eb-osc-c000000f544e2180: wait ext to 0 timedout, recovery in progress?
Not sure if that's related though, as I haven't looked at the code to decipher its meaning yet. |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ] |
|
please apply this patch: http://review.whamcloud.com/5419 and monitor console while you're doing test, thanks. |
| Comment by Prakash Surya (Inactive) [ 13/Feb/13 ] |
|
OK. I'll pull that into a tag and see if I can get it installed soon. It will be going onto one of our production machines, so I just want to check that it's safe to remove that assertion, right? |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ] |
|
won't be worse. |
| Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ] |
|
Prakash, right, don't waste time running Jinshan, http://review.whamcloud.com/5208 patch #2 testing is running now, will update you on how it works. |
| Comment by Prakash Surya (Inactive) [ 19/Feb/13 ] |
|
Jinshan, I've attached "console.rzuseqlac2.bz2" which contains console output from a node running with 5419. There are many messages in the logs that are similar to what has been previously reported. |
| Comment by Bruno Faccini (Inactive) [ 25/Feb/13 ] |
|
Jinshan, in the last trace from Prakash, I find only ETIMEDOUT/-110 msgs from osc_extent_wait() !! Prakash, would have been also interesting to have crash-dump/Alt-SysRq-T to see if there were hung tasks in cl_page_own() (or better in the 2nd call to l_wait_event() in osc_extent_wait()) which can be already highly suspected since you has not been able to shutdown gracefully. But anyway, clearly during errors or time-outs handling we need to avoid the LBUG and at least wait or retry for ever, or decide what to do with these orphan pages. |
| Comment by Jinshan Xiong (Inactive) [ 25/Feb/13 ] |
|
After reading the backtrace, I realize this is the same problem with Just in case, did you apply patches both 5419 and 5208 in your test? Sorry for delay response, BTW. |
| Comment by Bruno Faccini (Inactive) [ 25/Feb/13 ] |
|
Jinshan, ptlrpcd being blocked can explain/cause the time-out, is that what you mean? But anyway in all cases don't you think we need to wait/retry unless to trigger the LBUG/Assert ?? Also your latest patch for |
| Comment by Jinshan Xiong (Inactive) [ 26/Feb/13 ] |
|
Yes, it is. Anyway I've understood the root cause of this problem, including why it hit assertion. Patch for I'm going to work out a new patch to address LASSERT problem. |
| Comment by Prakash Surya (Inactive) [ 26/Feb/13 ] |
|
Jinshan, the attached console was from a machine running the 2.3.58-13chaos tag which does have both 5419 and 5208 applied. |
| Comment by Jinshan Xiong (Inactive) [ 27/Feb/13 ] |
|
I saw some suspicious backtrace from console message. It'll be really useful if you happened to collect backtrace of all processes in the system? Also did you see some suspicious messages on the correctponding OSS? OSS ran into problem: 2013-02-16 01:37:23 LustreError: 11-0: lsa-OST0048-osc-c000000f546f0600: Communicating with 172.16.69.25@tcp1, operation obd_ping failed with -107 2013-02-16 01:37:23 Lustre: lsa-OST0048-osc-c000000f546f0600: Connection to lsa-OST0048 (at 172.16.69.25@tcp1) was lost; in progress operations using this service will wait for recovery to complete 2013-02-16 01:37:23 LustreError: 167-0: lsa-OST0048-osc-c000000f546f0600: This client was evicted by lsa-OST0048; in progress operations using this service will fail. And backtrace from console: 2013-02-16 01:39:03 INFO: task ldlm_bl_13:28757 blocked for more than 120 seconds. 2013-02-16 01:39:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-16 01:39:03 ldlm_bl_13 D 0000000000000000 0 28757 2 0x00008000 2013-02-16 01:39:03 Call Trace: 2013-02-16 01:39:03 [c000000f3d813550] [c000000f3d813600] 0xc000000f3d813600 (unreliable) 2013-02-16 01:39:03 [c000000f3d813720] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-16 01:39:03 [c000000f3d8137b0] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-16 01:39:03 [c000000f3d813ab0] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-16 01:39:03 [c000000f3d813b90] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-16 01:39:03 [c000000f3d813c10] [d00000000f292398] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-16 01:39:03 [c000000f3d813ca0] [d0000000111ecf98] .osc_ldlm_blocking_ast+0x98/0x4e0 [osc] 2013-02-16 01:39:03 [c000000f3d813d80] [d00000001095b440] .ldlm_handle_bl_callback+0x1b0/0x7d0 [ptlrpc] 2013-02-16 01:39:03 [c000000f3d813e40] [d00000001095bdd0] .ldlm_bl_thread_main+0x370/0x610 [ptlrpc] 2013-02-16 01:39:03 [c000000f3d813f90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-16 01:39:03 INFO: task ldlm_bl_14:28758 blocked for more than 120 seconds. 2013-02-16 01:39:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-16 01:39:03 ldlm_bl_14 D 0000000000000000 0 28758 2 0x00008000 2013-02-16 01:39:03 Call Trace: 2013-02-16 01:39:03 [c000000dadbfb550] [c000000dadbfb600] 0xc000000dadbfb600 (unreliable) 2013-02-16 01:39:03 [c000000dadbfb720] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-16 01:39:03 [c000000dadbfb7b0] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-16 01:39:03 [c000000dadbfbab0] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-16 01:39:03 [c000000dadbfbb90] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-16 01:39:03 [c000000dadbfbc10] [d00000000f292398] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-16 01:39:03 [c000000dadbfbca0] [d0000000111ecf98] .osc_ldlm_blocking_ast+0x98/0x4e0 [osc] 2013-02-16 01:39:03 [c000000dadbfbd80] [d00000001095b440] .ldlm_handle_bl_callback+0x1b0/0x7d0 [ptlrpc] 2013-02-16 01:39:03 [c000000dadbfbe40] [d00000001095bdd0] .ldlm_bl_thread_main+0x370/0x610 [ptlrpc] 2013-02-16 01:39:03 [c000000dadbfbf90] [c000000000032fd4] .kernel_thread+0x54/0x70 2013-02-16 01:39:03 INFO: task atsmercury_back:21298 blocked for more than 120 seconds. 2013-02-16 01:39:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-02-16 01:39:03 atsmercury_ba D 00000080538c8c60 0 21298 21297 0x00008000 2013-02-16 01:39:03 Call Trace: 2013-02-16 01:39:03 [c0000005af6b2ed0] [c0000005af6b2f80] 0xc0000005af6b2f80 (unreliable) 2013-02-16 01:39:03 [c0000005af6b30a0] [c0000000000142d8] .__switch_to+0xf8/0x1d0 2013-02-16 01:39:03 [c0000005af6b3130] [c0000000005bb0a8] .schedule+0x3f8/0xd30 2013-02-16 01:39:03 [c0000005af6b3430] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0 2013-02-16 01:39:03 [c0000005af6b3510] [c0000000005bd42c] .mutex_lock+0x5c/0x60 2013-02-16 01:39:03 [c0000005af6b3590] [d00000000f292398] .cl_lock_mutex_get+0xc8/0x110 [obdclass] 2013-02-16 01:39:03 [c0000005af6b3620] [d00000000f295c7c] .cl_lock_hold_mutex+0x11c/0x960 [obdclass] 2013-02-16 01:39:03 [c0000005af6b3750] [d00000000f297d60] .cl_lock_request+0xc0/0x370 [obdclass] 2013-02-16 01:39:03 [c0000005af6b3820] [d00000001369fb94] .cl_glimpse_lock+0x2b4/0x640 [lustre] 2013-02-16 01:39:03 [c0000005af6b3910] [d0000000136a0118] .cl_glimpse_size0+0x1f8/0x270 [lustre] 2013-02-16 01:39:03 [c0000005af6b39e0] [d0000000136315f0] .ll_inode_revalidate_it+0x220/0x2c0 [lustre] 2013-02-16 01:39:03 [c0000005af6b3aa0] [d0000000136316d0] .ll_getattr_it+0x40/0x180 [lustre] 2013-02-16 01:39:03 [c0000005af6b3b40] [d000000013631854] .ll_getattr+0x44/0x60 [lustre] 2013-02-16 01:39:03 [c0000005af6b3bf0] [c0000000001c8fe4] .vfs_getattr+0x74/0xf0 2013-02-16 01:39:03 [c0000005af6b3c90] [c0000000001c90e0] .vfs_fstatat+0x80/0xb0 2013-02-16 01:39:03 [c0000005af6b3d30] [c0000000001c9274] .SyS_newlstat+0x24/0x50 2013-02-16 01:39:03 [c0000005af6b3e30] [c000000000008564] syscall_exit+0x0/0x40 2013-02-16 01:39:03 LustreError: 21507:0:(import.c:329:ptlrpc_invalidate_import()) lsa-OST0048_UUID: rc = -110 waiting for callback (1 != 0) |
| Comment by Jinshan Xiong (Inactive) [ 27/Feb/13 ] |
|
From the stacktrace above, the blocking_ast was blocked at acquiring cl_lock mutex, so the reply of write RPC shouldn't be blocked by it because different ptlrpc portal should be used. The status on the OST is also very important. |
| Comment by Prakash Surya (Inactive) [ 27/Feb/13 ] |
|
Here's the only messages on the OSS console I see regarding this client on 2013-02-16: 2013-02-16 01:36:56 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 84460s: evicting client at 172.21.1.62@o2ib200 ns: filter-lsa-OST0048_UUID lock: ffff880078fbf800/0x63cba3a8ff7a6537 lrc: 3/0,0 mode: PW/PW res: 19309611/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->65535) flags: 0x10020 remote: 0x4a6511c74b338406 expref: 6 pid: 3504 timeout 19649874112 2013-02-16 01:41:10 Lustre: lsa-OST0048: haven't heard from client a15eb878-e92e-c8f0-0c41-9520d958df72 (at 172.21.1.62@o2ib200) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8801102ff000, cur 1361007670 expire 1361007520 last 1361007443 |
| Comment by Jinshan Xiong (Inactive) [ 03/Mar/13 ] |
|
Hi Prakash, Bruno will try to reproduce this issue locally. Can you please reproduce this issue and collect the output of backtrace of all processes? Also, please apply patch 2638 and patch set 2 of 5419 when doing the test, thanks in advance. |
| Comment by Bruno Faccini (Inactive) [ 06/Mar/13 ] |
|
Jinshan, Prakash, Since end of last week, I did intensive testing running with Prakash, may be you can help too if you can qualify with the behavior of the workload that was running at the time of the crashes ?? |
| Comment by Adrian Ulrich (Inactive) [ 26/Apr/13 ] |
|
We recently started to hit this issue (3 crashes so far) while running the git-v2_3_61 client. I could upload the dumped `vmcore' from (at lest) two crashes if someone would like to have a look at them. |
| Comment by Peter Jones [ 26/Apr/13 ] |
|
Yes please Adrian! |
| Comment by Adrian Ulrich (Inactive) [ 26/Apr/13 ] |
|
Does whamcloud have a private/write-only FTP server to upload the cores? (~16GB) -> The crash data is from a login-node and will therefore include private user data. If not: Just give me a pgp-key ID and i'll encrypt + upload them to our own webserver |
| Comment by Peter Jones [ 26/Apr/13 ] |
|
yes we have a private ftp area. I will mail you the details directly |
| Comment by Adrian Ulrich (Inactive) [ 26/Apr/13 ] |
|
Thanks: I started to upload the first core to I'll upload the 2nd vmcore file on Monday (the firstone is still uploading) |
| Comment by Peter Jones [ 26/Apr/13 ] |
|
Thanks Adrian! |
| Comment by Jinshan Xiong (Inactive) [ 01/May/13 ] |
|
Hi Adrian, can you please try patch: commit 2448de6c51ceccea6a308d73d7960f236e0c0847 and see if it helps. |
| Comment by Adrian Ulrich (Inactive) [ 02/May/13 ] |
|
We already have this patch: Our RPM was compiled using the v2_3_63_0 tag (which includes commit 2448de6c51ceccea6a308d73d7960f236e0c0847) Btw: We just had yet another crash, so i could provide yet another 8GB vmcore |
| Comment by Jinshan Xiong (Inactive) [ 02/May/13 ] |
|
Can you please upload the vmcore along with the kernel modules on the client side? and what's the kernel version #? Just in case, the tip of your git tree is f3ef9ea9, right? |
| Comment by Adrian Ulrich (Inactive) [ 02/May/13 ] |
|
I'll upload the latest crash + the lustre *.ko-modules to "uploads/ We are running 2.6.32-358.0.1.el6.x86_64
Uhm: Just noticed that the crashed node was still running 2.3.62 |
| Comment by Jinshan Xiong (Inactive) [ 02/May/13 ] |
|
Thank you. I will take a look at it. |
| Comment by Jinshan Xiong (Inactive) [ 03/May/13 ] |
|
Hi Adrian, Please check patch at http://review.whamcloud.com/6262, with this patch, there should be no assertion any more. However, there are other problems from what I can see from the console message. The client was evicted by some OSTs, for example, this kind of message: <4>Lustre: nero-OST0006-osc-ffff881c3a770400: Connection to nero-OST0006 (at 10.201.62.37@o2ib) was lost; in progress operations using this service will wait for recovery to complete <3>LustreError: 167-0: nero-OST0006-osc-ffff881c3a770400: This client was evicted by nero-OST0006; in progress operations using this service will fail. I need the log on the OST side to know what happened. ALso, there are couple of OOM messages. Can you please show me the output of: lctl get_param llite.*.max_cached_mb and lctl get_param osc.*.cached_mb on the client side? Thanks! |
| Comment by Adrian Ulrich (Inactive) [ 06/May/13 ] |
|
Thanks for the patch! I'll give it a try. I know about the eviction and the OOM errors: This is from a login node where people run really ugly stuff (forkbombing themselfs, allocating gigabytes of RAM, etc). ltcl output: $ lctl get_param llite.nero-ffff8804388ba800.max_cached_mb llite.nero-ffff8804388ba800.max_cached_mb= users: 32 max_cached_mb: 96767 used_mb: 49481 unused_mb: 47286 reclaim_count: 0 $ lctl get_param osc.*.osc_cached_mb osc.nero-OST0000-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1474 busy_cnt: 0 osc.nero-OST0001-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1297 busy_cnt: 2 osc.nero-OST0002-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1387 busy_cnt: 1 osc.nero-OST0003-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1611 busy_cnt: 2 osc.nero-OST0004-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1593 busy_cnt: 3 osc.nero-OST0005-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1369 busy_cnt: 1 osc.nero-OST0006-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1361 busy_cnt: 0 osc.nero-OST0007-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1253 busy_cnt: 0 osc.nero-OST0008-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1044 busy_cnt: 29 osc.nero-OST0009-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1358 busy_cnt: 2 osc.nero-OST000a-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1075 busy_cnt: 0 osc.nero-OST000b-osc-ffff8804388ba800.osc_cached_mb=used_mb: 947 busy_cnt: 1 osc.nero-OST000c-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1029 busy_cnt: 1 osc.nero-OST000d-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1015 busy_cnt: 25 osc.nero-OST000e-osc-ffff8804388ba800.osc_cached_mb=used_mb: 872 busy_cnt: 2 osc.nero-OST000f-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1324 busy_cnt: 0 osc.nero-OST0010-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1001 busy_cnt: 1 osc.nero-OST0011-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1651 busy_cnt: 1 osc.nero-OST0012-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1379 busy_cnt: 0 osc.nero-OST0013-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1558 busy_cnt: 3 osc.nero-OST0014-osc-ffff8804388ba800.osc_cached_mb=used_mb: 2142 busy_cnt: 1 osc.nero-OST0015-osc-ffff8804388ba800.osc_cached_mb=used_mb: 2085 busy_cnt: 2 osc.nero-OST0016-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1758 busy_cnt: 0 osc.nero-OST0017-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1167 busy_cnt: 1 osc.nero-OST0018-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1551 busy_cnt: 1 osc.nero-OST0019-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1236 busy_cnt: 1 osc.nero-OST001a-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1200 busy_cnt: 2 osc.nero-OST001b-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1463 busy_cnt: 1 osc.nero-OST001c-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1130 busy_cnt: 1 osc.nero-OST001d-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1304 busy_cnt: 2 osc.nero-OST001e-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1516 busy_cnt: 1 osc.nero-OST001f-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1704 busy_cnt: 0 The relevant logs on the OSS during the eviction: 2013-05-02T17:25:40+02:00 n-oss02 LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 12291s: evicting client at 10.201.32.34@o2ib ns: filter-nero-OST0019_UUID lock: ffff8801b1d256c0/0x35306a2f3a794cd 3 lrc: 3/0,0 mode: PW/PW res: 76295681/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x63d4bbb0c510ffb7 expref: 3351 pid: 4333 timeout 27550303246 2013-05-02T17:25:45+02:00 n-oss02 LustreError: 4489:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-107) req@ffff8805f1b0e000 x1431201755470460/t0(0) o400-><?>@<?>:0/0 lens 224/0 e 0 to 0 dl 1367508351 ref 1 fl Interpret:H/0/fff fffff rc -107/-1 2013-05-02T17:38:47+02:00 n-oss02 LustreError: 2752:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds 2013-05-02T17:38:47+02:00 n-oss02 LustreError: 2752:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.201.32.34@o2ib (68): c: 0, oc: 0, rc: 8 2013-05-02T17:39:32+02:00 n-oss02 LustreError: 138-a: nero-OST0001: A client on nid 10.201.32.34@o2ib was evicted due to a lock glimpse callback time out: rc -4 2013-05-02T18:47:14+02:00 n-oss02 LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 7026s: evicting client at 10.201.43.27@o2ib ns: filter-nero-OST0001_UUID lock: ffff88047bd866c0/0x35306a2f3c0c65ec lrc: 3/0,0 mode: PW/PW res: 76322164/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0xf6d8f9457b07fb6a expref: 176 pid: 3961 timeout 27555197173 2013-05-02T18:47:32+02:00 n-oss02 LustreError: 4288:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-107) req@ffff8805d7e86400 x1427319253357050/t0(0) o400-><?>@<?>:0/0 lens 224/0 e 0 to 0 dl 1367513258 ref 1 fl Interpret:H/0/fff fffff rc -107/-1 2013-05-02T20:34:41+02:00 n-oss02 LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 20590s: evicting client at 10.201.35.45@o2ib ns: filter-nero-OST0009_UUID lock: ffff88082d127900/0x35306a2f3b4ac15 1 lrc: 3/0,0 mode: PW/PW res: 76311076/0 rrc: 2 type: EXT [0->18446744073709551615] (req 8388608->18446744073709551615) flags: 0x20 remote: 0x60a148bfb337040c expref: 208 pid: 4312 timeout 27561644539 2013-05-02T20:35:27+02:00 n-oss02 LustreError: 4288:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-107) req@ffff8805ff210400 x1427382795207052/t0(0) o400-><?>@<?>:0/0 lens 224/0 e 0 to 0 dl 1367519733 ref 1 fl Interpret:H/0/fff fffff rc -107/-1 |
| Comment by Jinshan Xiong (Inactive) [ 06/May/13 ] |
|
What's the ip address of the node where crash happened? |
| Comment by Adrian Ulrich (Inactive) [ 07/May/13 ] |
|
The NID of the evicted and crashed client is: 10.201.32.34@o2ib I'm not very surprised about the eviction: The client is known to have a pretty bad infiniband interface: We already observed random IB stalls on this hardware type. |
| Comment by Patrick Farrell (Inactive) [ 14/Aug/13 ] |
|
Jinshan, Cray has been hitting what appears to be this bug recently. We'll be running with http://review.whamcloud.com/#/c/5419/ on our tests systems and will report back with what we see. |
| Comment by Cheng Shao (Inactive) [ 21/Aug/13 ] |
|
I understand that we revamped the osc_lock_flush code path to replace the page-based approach as shown in cl_lock_page_out with extent-based one now. In the old code path, we will end up waiting in cl_sync_io_wait and if the first wait timed out, we will enter the second infinite uninterruptible wait anyway. That is equivalent to the effect of applying Jinshan's patch above. In another word, the simple fix doesn't make it worse. Therefore, should we move forward to get it landed? |
| Comment by Cory Spitz [ 22/Aug/13 ] |
|
This bug has two patches, #5419 and #6262. One should be abandoned. |
| Comment by Patrick Farrell (Inactive) [ 11/Sep/13 ] |
|
Forgot to update this with our results. We haven't had this issue since landing http://review.whamcloud.com/#/c/5419/. In addition, we haven't noticed any of the possible issues with unkillable threads. |
| Comment by Christopher Morrone [ 11/Sep/13 ] |
|
LLNL has been carrying the 5419 patch in our tree and running with it in production. On one of our smaller BG/Q systems I counted the number of hits of the "wait ext to %d timedout, recovery in progress?" message from the console logs and found it hit 1258 times. There is a fair bit of clustering that I didn't spend the time colapsing, so that may be more like 20-100 times since May. |
| Comment by Peter Jones [ 24/Sep/13 ] |
|
So, a patch just landed to master for this issue. Is that enough to warrant marking the issue as resolved or is something further required? |
| Comment by Patrick Farrell (Inactive) [ 24/Sep/13 ] |
|
From the Cray perspective, I don't see anything further needed. LLNL might feel differently. |
| Comment by Christopher Morrone [ 24/Sep/13 ] |
|
Oh, hmm...actually we are running any earlier version of the patch. I have no idea what to make of the one that landed. |
| Comment by Marek Magrys [ 24/Sep/13 ] |
|
We also hit this bug on 2.4.0 clients and 2.4.1RC2 servers. I think that if possible the patch should be added to 2.4.2 release, as it is severe. |
| Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ] |
|
Hi Chris, What's the version you're running right now? Jinshan |
| Comment by Christopher Morrone [ 25/Sep/13 ] |
|
Patch set 2. |
| Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ] |
|
Patch set 2 has more debug information. The real fix is the same. |
| Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ] |
|
The reason why this patch was made is that OSC has to wait for the IO RPC to finish anyway, no matter how much time it will need; otherwise it will hit the assertion in discard_cb(). If the OST is in recovery, it may take really long time for that OST to finish the RPC. |
| Comment by Christopher Morrone [ 01/Oct/13 ] |
|
But the fix looks different. Patch set 4 changes the LWI_INTR interrupt handler from LWI_ON_SIGNAL_NOOP to NULL, and does nothing else. Patch set 2 did not do that. Instead the only significant change in patch set 2 was the removal of the KLASSERT. I'm missing how those two things are equivalent. |
| Comment by Jinshan Xiong (Inactive) [ 02/Oct/13 ] |
|
indeed. patch set 2 allows the interrupt by setting interruption callback to LWI_ON_SIGNAL_NOOP. In that case, it makes sense to upgrade to patch set 4. |