[LU-2779] LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page))) Created: 07/Feb/13  Updated: 07/Aug/15  Resolved: 25/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Critical
Reporter: Prakash Surya (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 2
Labels: sequoia, topsequoia

Attachments: File console.rzuseqlac2.bz2    
Issue Links:
Duplicate
duplicates LU-2683 Client deadlock in cl_lock_mutex_get Resolved
is duplicated by LU-4581 ASSERTION( (!(page->cp_type == CPT_CA... Resolved
Related
Severity: 3
Rank (Obsolete): 6732

 Description   

We hit this on Sequoia during shutdown time, I don't see an existing bug open for this crash either:

2013-02-01 14:35:53.812430 {R4-llnl} [bgqadmin]{5}.3.1: 
2013-02-01 14:35:53.812843 {R4-llnl} [bgqadmin]{5}.3.1: Broadcast message from root@seqio262-ib0
2013-02-01 14:35:53.813165 {R4-llnl} [bgqadmin]{5}.3.1: 	(unknown) at 14:35 ...
2013-02-01 14:35:53.813752 {R4-llnl} [bgqadmin]{5}.3.1: The system is going down for halt NOW!
2013-02-01 14:35:53.814093 {R4-llnl} [bgqadmin]{5}.2.3: Stopping Common I/O Services: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed: 
2013-02-01 14:35:53.814429 {R4-llnl} [bgqadmin]{5}.2.3: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) LBUG
2013-02-01 14:35:53.814746 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
2013-02-01 14:35:53.815076 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b40] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-02-01 14:35:53.815397 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392bf0] [8000000000ab0c88] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
2013-02-01 14:35:53.815717 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab1450] .lbug_with_loc+0x50/0xc0 [libcfs]
2013-02-01 14:35:53.816042 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
2013-02-01 14:35:53.816392 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
2013-02-01 14:35:53.816732 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
2013-02-01 14:35:53.817047 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
2013-02-01 14:35:53.817363 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
2013-02-01 14:35:53.817877 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
2013-02-01 14:35:53.818248 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
2013-02-01 14:35:53.818565 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
2013-02-01 14:35:53.818901 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
2013-02-01 14:35:53.819221 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
2013-02-01 14:35:53.819535 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
2013-02-01 14:35:53.819908 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
2013-02-01 14:35:53.820387 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
2013-02-01 14:35:53.820707 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-01 14:35:53.821021 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-01 14:35:53.821337 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-01 14:35:53.821652 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-01 14:35:53.821966 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
2013-02-01 14:35:53.822282 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
2013-02-01 14:35:53.822595 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
2013-02-01 14:35:53.822909 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
2013-02-01 14:35:53.823222 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c
2013-02-01 14:35:53.823534 {R4-llnl} [bgqadmin]{5}.2.3: Kernel panic - not syncing: LBUG
2013-02-01 14:35:53.823844 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
2013-02-01 14:35:53.824153 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-02-01 14:35:53.824466 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392c10] [c000000000431ef4] .panic+0x80/0x1ac
2013-02-01 14:35:53.824776 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab14b0] .lbug_with_loc+0xb0/0xc0 [libcfs]
2013-02-01 14:35:53.825089 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
2013-02-01 14:35:53.825401 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
2013-02-01 14:35:53.825721 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
2013-02-01 14:35:53.826045 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
2013-02-01 14:35:53.826358 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
2013-02-01 14:35:53.826670 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
2013-02-01 14:35:53.826982 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
2013-02-01 14:35:53.827295 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
2013-02-01 14:35:53.827608 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
2013-02-01 14:35:53.827920 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
2013-02-01 14:35:53.828232 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
2013-02-01 14:35:53.828649 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
2013-02-01 14:35:53.829092 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
2013-02-01 14:35:53.829361 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-01 14:35:53.829629 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-01 14:35:53.829892 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-01 14:35:53.830155 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-01 14:35:53.830420 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
2013-02-01 14:35:53.830686 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
2013-02-01 14:35:53.830952 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
2013-02-01 14:35:53.831217 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
2013-02-01 14:35:53.831483 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c


 Comments   
Comment by Bruno Faccini (Inactive) [ 08/Feb/13 ]

This Assert looks familiar to me since I triggered it during LU-874 tests and attempts to reproduce, running with a Lustre version that was builded for LU-1680 patches ...

Will try to 1st do some historical research to understand the conditions leading to this problem.

Comment by Prakash Surya (Inactive) [ 11/Feb/13 ]

This might end up being related to, or a duplicate of, LU-2683. I'm looking at a console log for a node which hit this a few days ago, and it triggered when a reboot was attempted after what appears to be the LU-2683 hang.

For example, I see these messages:

2013-02-04 10:36:46.046647 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:329:ptlrpc_invalidate_import()) ls1-OST0132_UUID: rc = -110 waiting for callback (1 != 0)
2013-02-04 10:36:46.047324 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:329:ptlrpc_invalidate_import()) Skipped 91 previous similar messages
2013-02-04 10:36:46.047897 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:355:ptlrpc_invalidate_import()) @@@ still on sending list  req@c0000002818f5000 x1425746435123190/t0(0) o101->ls1-OST0132-osc-c0000003e98eb000@172.20.2.106@o2ib500:28/4 lens 328/368 e 0 to 0 dl 1359967526 ref 1 fl Interpret:RE/0/0 rc -5/301
2013-02-04 10:36:46.048227 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:355:ptlrpc_invalidate_import()) Skipped 98 previous similar messages
2013-02-04 10:36:46.048870 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:371:ptlrpc_invalidate_import()) ls1-OST0132_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.
2013-02-04 10:36:46.049450 {DefaultControlEventListener} [mmcs]{739}.14.1: LustreError: 35299:0:(import.c:371:ptlrpc_invalidate_import()) Skipped 101 previous similar messages

And many other network/recovery related messages in the logs for hours before the LBUG was triggered.

Comment by Christopher Morrone [ 12/Feb/13 ]

We've been hitting this more frequently recently, on login nodes in particular.

Comment by Prakash Surya (Inactive) [ 12/Feb/13 ]

Chris, do we see this in the absence of LU-2683 on the LAC nodes? All of the occurrences I've seen in the ION logs are at reboot time after LU-2683 likely occurred.

Comment by Prakash Surya (Inactive) [ 12/Feb/13 ]

Jinshan, when you get some time, can you look over this ticket? We're constantly hitting this in production, as of the past couple days.

Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ]

I'm looking at this one.

Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ]

I have known the problem, I will work out a patch tomorrow.

Comment by Bruno Faccini (Inactive) [ 13/Feb/13 ]

Sorry to be late on this but it took me sometimes to retrieve the infos from the times I hit this bug weeks ago ...

In fact I have already seen it when I tried to shutdown nodes with hung threads with the following kind of stacks I was able to dump :

==============================================================================================
Nov 27 01:09:17 client-15 kernel: reproducer S 0000000000000001 0 8231 1 0x00000080
Nov 27 01:09:17 client-15 kernel: ffff8802e399d538 0000000000000082 ffff8802e399d538 ffffffff814fea8a
Nov 27 01:09:17 client-15 kernel: ffff8802e399d4b8 0000000000000286 000000000000001c 00000000000000ff
Nov 27 01:09:17 client-15 kernel: ffff88032f0e3af8 ffff8802e399dfd8 000000000000fb88 ffff88032f0e3af8
Nov 27 01:09:17 client-15 kernel: Call Trace:
Nov 27 01:09:17 client-15 kernel: [<ffffffff814fea8a>] ? schedule_timeout+0x19a/0x2e0
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e0d77e>] cfs_waitq_wait+0xe/0x10 [libcfs]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa05f6d27>] osc_extent_wait+0x5f7/0x6c0 [osc]
Nov 27 01:09:17 client-15 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Nov 27 01:09:17 client-15 kernel: [<ffffffffa05f714c>] osc_cache_wait_range+0x35c/0x9a0 [osc]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f85a3e>] ? cl_object_put+0xe/0x10 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa05f38b8>] ? osc_io_unplug0+0x748/0x12a0 [osc]
Nov 27 01:09:17 client-15 kernel: [<ffffffff8127ce66>] ? vsnprintf+0x2b6/0x5f0
Nov 27 01:09:17 client-15 kernel: [<ffffffffa05fc731>] osc_cache_writeback_range+0xd61/0x1160 [osc]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e0c27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e22d97>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e222d2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f8709d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e25cbb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa05e40f6>] osc_lock_flush+0x86/0x200 [osc]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa05e4357>] osc_lock_cancel+0xe7/0x1c0 [osc]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f8d0a5>] cl_lock_cancel0+0x75/0x160 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f8ece6>] cl_lock_hold_release+0x1c6/0x2b0 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f90837>] cl_lock_unhold+0x37/0x130 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa067e5f8>] lov_sublock_release+0x1a8/0x2a0 [lov]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa06811fb>] lov_lock_enqueue+0x2cb/0x830 [lov]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f9149c>] cl_enqueue_try+0xfc/0x310 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f9295d>] cl_enqueue_locked+0x6d/0x210 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0f9363e>] cl_lock_request+0x7e/0x280 [obdclass]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0be3deb>] cl_glimpse_lock+0x17b/0x4a0 [lustre]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0be4677>] cl_glimpse_size0+0x187/0x190 [lustre]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0ba0ee2>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0e22134>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0ba0ff9>] ll_getattr_it+0x49/0x170 [lustre]
Nov 27 01:09:17 client-15 kernel: [<ffffffffa0ba1157>] ll_getattr+0x37/0x40 [lustre]
Nov 27 01:09:17 client-15 kernel: [<ffffffff812143d3>] ? security_inode_getattr+0x23/0x30
Nov 27 01:09:17 client-15 kernel: [<ffffffff81180571>] vfs_getattr+0x51/0x80
Nov 27 01:09:17 client-15 kernel: [<ffffffff8118082f>] vfs_fstat+0x3f/0x60
Nov 27 01:09:17 client-15 kernel: [<ffffffff81180874>] sys_newfstat+0x24/0x40
Nov 27 01:09:17 client-15 kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
Nov 27 01:09:17 client-15 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
==============================================================================================

before node crashed during shutdown within the same task context and with the following stack which is the same you got :

LustreError: 8231:0:(cl_lock.c:1960:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed: 
LustreError: 8231:0:(cl_lock.c:1960:discard_cb()) LBUG
[ OK Pid: 8231, comm: reproducer
Call Trace:
[<ffffffffa0e0c905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0e0cf17>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0f8f66c>] discard_cb+0x1ec/0x1f0 [obdclass]
[<ffffffffa0f8c864>] cl_page_gang_lookup+0x1f4/0x400 [obdclass]
[<ffffffffa0f8f480>] ? discard_cb+0x0/0x1f0 [obdclass]
[<ffffffffa0f8f480>] ? discard_cb+0x0/0x1f0 [obdclass]
[<ffffffffa0f8f33e>] cl_lock_discard_pages+0x11e/0x1f0 [obdclass]
[<ffffffffa05e4180>] osc_lock_flush+0x110/0x200 [osc]
[<ffffffffa05e4357>] osc_lock_cancel+0xe7/0x1c0 [osc]
[<ffffffffa0f8d0a5>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa0f8ece6>] cl_lock_hold_release+0x1c6/0x2b0 [obdclass]
[<ffffffffa0f90837>] cl_lock_unhold+0x37/0x130 [obdclass]
[<ffffffffa067e5f8>] lov_sublock_release+0x1a8/0x2a0 [lov]
[<ffffffffa06811fb>] lov_lock_enqueue+0x2cb/0x830 [lov]
[<ffffffffa0f9149c>] cl_enqueue_try+0xfc/0x310 [obdclass]
[<ffffffffa0f9295d>] cl_enqueue_locked+0x6d/0x210 [obdclass]
[<ffffffffa0f9363e>] cl_lock_request+0x7e/0x280 [obdclass]
[<ffffffffa0be3deb>] cl_glimpse_lock+0x17b/0x4a0 [lustre]
[<ffffffffa0be4677>] cl_glimpse_size0+0x187/0x190 [lustre]
[<ffffffffa0ba0ee2>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
[<ffffffffa0e22134>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa0ba0ff9>] ll_getattr_it+0x49/0x170 [lustre]
[<ffffffffa0ba1157>] ll_getattr+0x37/0x40 [lustre]
[<ffffffff812143d3>] ? security_inode_getattr+0x23/0x30
[<ffffffff81180571>] vfs_getattr+0x51/0x80
[<ffffffff8118082f>] vfs_fstat+0x3f/0x60
[<ffffffff81180874>] sys_newfstat+0x24/0x40
[<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

and I confirm that there was also recent msgs, like you have, claiming for "unregistering" RPCs and possible sluggish network.

Can you confirm if you have the same kind of stacks for hung threads in the crash-dumps from previous occurences ?

If yes, Jinshan, could it be possible that we handle again pages that should have been already discarded but are still stuck (orphaned?) awaiting write-back ??

BTW, I got this LBUG when trying to reproduce LU-874 problem with its known reproducer, and LU-874 has also be linked to LU-2683. Since that time I also installed http://review.whamcloud.com/5208 from LU-2683 and I have not been able to reproduce anymore.

Is it possible for you to run LU-874 reproducer on you system where failure occur ?? I can help to refresh details about it if required but the code is very simple and the way to run it too in LU-874.

And also, it would help to have "dlmtrace", "cache" and "page" debug flags set in addition on Client-nodes likely to reproduce.

Comment by Bruno Faccini (Inactive) [ 13/Feb/13 ]

Jinshan, just some thoughts after re-reading source code involved and talking about it with Johann, don't you think it could be much safer to call cl_page_own() BEFORE any ASSERTs in discard_cb() ??

Seems to me that this will avoid situations where we exited with error (timeout?) from osc_cache_writeback_range(), which is not tested in osc_lock_flush(), it will then call cl_lock_discard_pages()/cl_page_gang_lookup()/discard_cb() and re-parse same pages and thus find some to trigger assert.

Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ]

Hi Bruno, you're right about the code. Obviously the process waiting for the IO to finish received a signal from init because the system is shutting down so yes, this problem can be fixed by taking away this LASSERT - this is what I planned to do last right. However, the process will still have to wait page writeback to finish at cl_page_own(), and it will hang forever if the OST is already shutdown at the moment. Therefore you have to hit the power button to shut it down.

This led me to think how to shutdown a client node gracefully. Obviously we should notify ptlrpcd threads for this event, then we should abort all of the write RPCs with error, then waiting processes can be notified and mount point can be cleaned up. This is out of the scope of this issue.

BTW, I pushed a new patch for LU-2683/LU-874, will you please verify it again? Thanks.

Comment by Prakash Surya (Inactive) [ 13/Feb/13 ]

I also want to note, that we hit this on the login nodes without a manual reboot or restart. There are traces of network issues (perhaps LU-2683, the logs are consistent) and evictions, followed by the ASSERT triggering on its own. Without an in depth knowledge of the client code, I'm inclined to think that the client was trying to clean up after the evictions, and happened try and discard a page which still had its writeback bit set.

Bruno, I can probably get some time to run your reproducer, but I'm not sure it will be enlightening in any way. Jinshan, I don't see an updated patch set to http://review.whamcloud.com/5208 yet (still patch set 2), have you pushed the new patch somewhere else?

Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ]

The only path I can think of right now is that osc_extent_wait() is interrupted so that it can't wait for the IO to finish. Can you try to reproduce it with DLMTRACE and CACHE enabled on the client side?

Comment by Prakash Surya (Inactive) [ 13/Feb/13 ]

I can try. Although, it's not dumping a lustre debug log when it ASSERTS, so I need to remind myself if it is supposed to or not. Also, it's hitting it without any messages to the console in nearly two hours prior to the crash. So maybe this isn't as closely related to evictions as I previously thought.

Comment by Prakash Surya (Inactive) [ 13/Feb/13 ]

I do see a couple messages from osc_extent_wait nearly two hours before one specific crash:

2013-02-13 01:45:43 LustreError: 47189:0:(osc_cache.c:896:osc_extent_wait()) extent c000000c3d8c2990@{[0 -> 0/15], [3|0|+|rpc|wihY|c000000c08582450], [65536|1|+|-|c000000bf0414730|16|c000000f53444090]} lsa-OST0101-osc-c000000f544e2180: wait ext to 0 timedout, recovery in progress?
2013-02-13 01:45:48 LustreError: 3843:0:(osc_cache.c:896:osc_extent_wait()) extent c000000c3d8c6158@{[0 -> 0/15], [3|0|+|rpc|wihY|c000000c08581a30], [65536|1|+|-|c000000bf041fe10|16|c000000f37c8b4e0]} lsa-OST00eb-osc-c000000f544e2180: wait ext to 0 timedout, recovery in progress?

Not sure if that's related though, as I haven't looked at the code to decipher its meaning yet.

Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ]

please apply this patch: http://review.whamcloud.com/5419 and monitor console while you're doing test, thanks.

Comment by Prakash Surya (Inactive) [ 13/Feb/13 ]

OK. I'll pull that into a tag and see if I can get it installed soon. It will be going onto one of our production machines, so I just want to check that it's safe to remove that assertion, right?

Comment by Jinshan Xiong (Inactive) [ 13/Feb/13 ]

won't be worse.

Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ]

Prakash, right, don't waste time running LU-874 reproducer, re-reading all our last comments and source code I am sure now that it was only one more occurrence of the same scenario we track here and you able to trigger easily with your own work-load.
So we only need now to wait for the debug traces and Jinshan's added ones, if you applied his patch from http://review.whamcloud.com/5419, to learn why we returned from osc_extent_wait().
But then, could this be simply and again because of a signal (SIGINT/SIGHUP from interactive session, Slurm for batch, ...) ??

Jinshan, http://review.whamcloud.com/5208 patch #2 testing is running now, will update you on how it works.

Comment by Prakash Surya (Inactive) [ 19/Feb/13 ]

Jinshan, I've attached "console.rzuseqlac2.bz2" which contains console output from a node running with 5419. There are many messages in the logs that are similar to what has been previously reported.

Comment by Bruno Faccini (Inactive) [ 25/Feb/13 ]

Jinshan, in the last trace from Prakash, I find only ETIMEDOUT/-110 msgs from osc_extent_wait() !!

Prakash, would have been also interesting to have crash-dump/Alt-SysRq-T to see if there were hung tasks in cl_page_own() (or better in the 2nd call to l_wait_event() in osc_extent_wait()) which can be already highly suspected since you has not been able to shutdown gracefully.

But anyway, clearly during errors or time-outs handling we need to avoid the LBUG and at least wait or retry for ever, or decide what to do with these orphan pages.

Comment by Jinshan Xiong (Inactive) [ 25/Feb/13 ]

After reading the backtrace, I realize this is the same problem with LU-2683. The IO has already been finished but the reply can't be handled because ptlrpcd thread was blocked at cl_mutex_lock_get(). I'm going to find a solution for both problem.

Just in case, did you apply patches both 5419 and 5208 in your test?

Sorry for delay response, BTW.

Comment by Bruno Faccini (Inactive) [ 25/Feb/13 ]

Jinshan, ptlrpcd being blocked can explain/cause the time-out, is that what you mean? But anyway in all cases don't you think we need to wait/retry unless to trigger the LBUG/Assert ??

Also your latest patch for LU-2683 was successful is'nt-it ?

Comment by Jinshan Xiong (Inactive) [ 26/Feb/13 ]

Yes, it is. Anyway I've understood the root cause of this problem, including why it hit assertion. Patch for LU-2683 caused another problem which I will take a look at.

I'm going to work out a new patch to address LASSERT problem.

Comment by Prakash Surya (Inactive) [ 26/Feb/13 ]

Jinshan, the attached console was from a machine running the 2.3.58-13chaos tag which does have both 5419 and 5208 applied.

Comment by Jinshan Xiong (Inactive) [ 27/Feb/13 ]

I saw some suspicious backtrace from console message. It'll be really useful if you happened to collect backtrace of all processes in the system? Also did you see some suspicious messages on the correctponding OSS?

OSS ran into problem:

2013-02-16 01:37:23 LustreError: 11-0: lsa-OST0048-osc-c000000f546f0600: Communicating with 172.16.69.25@tcp1, operation obd_ping failed with -107
2013-02-16 01:37:23 Lustre: lsa-OST0048-osc-c000000f546f0600: Connection to lsa-OST0048 (at 172.16.69.25@tcp1) was lost; in progress operations using this service will wait for recovery to complete
2013-02-16 01:37:23 LustreError: 167-0: lsa-OST0048-osc-c000000f546f0600: This client was evicted by lsa-OST0048; in progress operations using this service will fail.

And backtrace from console:

2013-02-16 01:39:03 INFO: task ldlm_bl_13:28757 blocked for more than 120 seconds.
2013-02-16 01:39:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-16 01:39:03 ldlm_bl_13    D 0000000000000000     0 28757      2 0x00008000
2013-02-16 01:39:03 Call Trace:
2013-02-16 01:39:03 [c000000f3d813550] [c000000f3d813600] 0xc000000f3d813600 (unreliable)
2013-02-16 01:39:03 [c000000f3d813720] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-16 01:39:03 [c000000f3d8137b0] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-16 01:39:03 [c000000f3d813ab0] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
2013-02-16 01:39:03 [c000000f3d813b90] [c0000000005bd42c] .mutex_lock+0x5c/0x60
2013-02-16 01:39:03 [c000000f3d813c10] [d00000000f292398] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
2013-02-16 01:39:03 [c000000f3d813ca0] [d0000000111ecf98] .osc_ldlm_blocking_ast+0x98/0x4e0 [osc]
2013-02-16 01:39:03 [c000000f3d813d80] [d00000001095b440] .ldlm_handle_bl_callback+0x1b0/0x7d0 [ptlrpc]
2013-02-16 01:39:03 [c000000f3d813e40] [d00000001095bdd0] .ldlm_bl_thread_main+0x370/0x610 [ptlrpc]
2013-02-16 01:39:03 [c000000f3d813f90] [c000000000032fd4] .kernel_thread+0x54/0x70
2013-02-16 01:39:03 INFO: task ldlm_bl_14:28758 blocked for more than 120 seconds.
2013-02-16 01:39:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-16 01:39:03 ldlm_bl_14    D 0000000000000000     0 28758      2 0x00008000
2013-02-16 01:39:03 Call Trace:
2013-02-16 01:39:03 [c000000dadbfb550] [c000000dadbfb600] 0xc000000dadbfb600 (unreliable)
2013-02-16 01:39:03 [c000000dadbfb720] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-16 01:39:03 [c000000dadbfb7b0] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-16 01:39:03 [c000000dadbfbab0] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
2013-02-16 01:39:03 [c000000dadbfbb90] [c0000000005bd42c] .mutex_lock+0x5c/0x60
2013-02-16 01:39:03 [c000000dadbfbc10] [d00000000f292398] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
2013-02-16 01:39:03 [c000000dadbfbca0] [d0000000111ecf98] .osc_ldlm_blocking_ast+0x98/0x4e0 [osc]
2013-02-16 01:39:03 [c000000dadbfbd80] [d00000001095b440] .ldlm_handle_bl_callback+0x1b0/0x7d0 [ptlrpc]
2013-02-16 01:39:03 [c000000dadbfbe40] [d00000001095bdd0] .ldlm_bl_thread_main+0x370/0x610 [ptlrpc]
2013-02-16 01:39:03 [c000000dadbfbf90] [c000000000032fd4] .kernel_thread+0x54/0x70
2013-02-16 01:39:03 INFO: task atsmercury_back:21298 blocked for more than 120 seconds.
2013-02-16 01:39:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-02-16 01:39:03 atsmercury_ba D 00000080538c8c60     0 21298  21297 0x00008000
2013-02-16 01:39:03 Call Trace:
2013-02-16 01:39:03 [c0000005af6b2ed0] [c0000005af6b2f80] 0xc0000005af6b2f80 (unreliable)
2013-02-16 01:39:03 [c0000005af6b30a0] [c0000000000142d8] .__switch_to+0xf8/0x1d0
2013-02-16 01:39:03 [c0000005af6b3130] [c0000000005bb0a8] .schedule+0x3f8/0xd30
2013-02-16 01:39:03 [c0000005af6b3430] [c0000000005bcb4c] .__mutex_lock_slowpath+0x1bc/0x2d0
2013-02-16 01:39:03 [c0000005af6b3510] [c0000000005bd42c] .mutex_lock+0x5c/0x60
2013-02-16 01:39:03 [c0000005af6b3590] [d00000000f292398] .cl_lock_mutex_get+0xc8/0x110 [obdclass]
2013-02-16 01:39:03 [c0000005af6b3620] [d00000000f295c7c] .cl_lock_hold_mutex+0x11c/0x960 [obdclass]
2013-02-16 01:39:03 [c0000005af6b3750] [d00000000f297d60] .cl_lock_request+0xc0/0x370 [obdclass]
2013-02-16 01:39:03 [c0000005af6b3820] [d00000001369fb94] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-16 01:39:03 [c0000005af6b3910] [d0000000136a0118] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-16 01:39:03 [c0000005af6b39e0] [d0000000136315f0] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-16 01:39:03 [c0000005af6b3aa0] [d0000000136316d0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-16 01:39:03 [c0000005af6b3b40] [d000000013631854] .ll_getattr+0x44/0x60 [lustre]
2013-02-16 01:39:03 [c0000005af6b3bf0] [c0000000001c8fe4] .vfs_getattr+0x74/0xf0
2013-02-16 01:39:03 [c0000005af6b3c90] [c0000000001c90e0] .vfs_fstatat+0x80/0xb0
2013-02-16 01:39:03 [c0000005af6b3d30] [c0000000001c9274] .SyS_newlstat+0x24/0x50
2013-02-16 01:39:03 [c0000005af6b3e30] [c000000000008564] syscall_exit+0x0/0x40
2013-02-16 01:39:03 LustreError: 21507:0:(import.c:329:ptlrpc_invalidate_import()) lsa-OST0048_UUID: rc = -110 waiting for callback (1 != 0)
Comment by Jinshan Xiong (Inactive) [ 27/Feb/13 ]

From the stacktrace above, the blocking_ast was blocked at acquiring cl_lock mutex, so the reply of write RPC shouldn't be blocked by it because different ptlrpc portal should be used. The status on the OST is also very important.

Comment by Prakash Surya (Inactive) [ 27/Feb/13 ]

Here's the only messages on the OSS console I see regarding this client on 2013-02-16:

2013-02-16 01:36:56 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 84460s: evicting client at 172.21.1.62@o2ib200  ns: filter-lsa-OST0048_UUID lock: ffff880078fbf800/0x63cba3a8ff7a6537 lrc: 3/0,0 mode: PW/PW res: 19309611/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->65535) flags: 0x10020 remote: 0x4a6511c74b338406 expref: 6 pid: 3504 timeout 19649874112
2013-02-16 01:41:10 Lustre: lsa-OST0048: haven't heard from client a15eb878-e92e-c8f0-0c41-9520d958df72 (at 172.21.1.62@o2ib200) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8801102ff000, cur 1361007670 expire 1361007520 last 1361007443
Comment by Jinshan Xiong (Inactive) [ 03/Mar/13 ]

Hi Prakash,

Bruno will try to reproduce this issue locally. Can you please reproduce this issue and collect the output of backtrace of all processes? Also, please apply patch 2638 and patch set 2 of 5419 when doing the test, thanks in advance.

Comment by Bruno Faccini (Inactive) [ 06/Mar/13 ]

Jinshan, Prakash,

Since end of last week, I did intensive testing running with LU-874/LU-2683 patch (aka, change 5208) and I have never been able to reproduce this problem anymore, even when simulating/injecting some networking issues to cause the thread hangs where bad things start ... May be you have better idea on how to fall in the same situation ??

Prakash, may be you can help too if you can qualify with the behavior of the workload that was running at the time of the crashes ??

Comment by Adrian Ulrich (Inactive) [ 26/Apr/13 ]

We recently started to hit this issue (3 crashes so far) while running the git-v2_3_61 client.

I could upload the dumped `vmcore' from (at lest) two crashes if someone would like to have a look at them.

Comment by Peter Jones [ 26/Apr/13 ]

Yes please Adrian!

Comment by Adrian Ulrich (Inactive) [ 26/Apr/13 ]

Does whamcloud have a private/write-only FTP server to upload the cores? (~16GB) -> The crash data is from a login-node and will therefore include private user data.

If not: Just give me a pgp-key ID and i'll encrypt + upload them to our own webserver

Comment by Peter Jones [ 26/Apr/13 ]

yes we have a private ftp area. I will mail you the details directly

Comment by Adrian Ulrich (Inactive) [ 26/Apr/13 ]

Thanks: I started to upload the first core to LU-2779/127.0.0.1-2013-04-17-10:22:30/

I'll upload the 2nd vmcore file on Monday (the firstone is still uploading)

Comment by Peter Jones [ 26/Apr/13 ]

Thanks Adrian!

Comment by Jinshan Xiong (Inactive) [ 01/May/13 ]

Hi Adrian, can you please try patch:

commit 2448de6c51ceccea6a308d73d7960f236e0c0847
Author: Jinshan Xiong <jinshan.xiong@intel.com>
Date: Tue Jan 29 16:35:49 2013 -0800

LU-2683 lov: release all locks in closure to release sublock

and see if it helps.

Comment by Adrian Ulrich (Inactive) [ 02/May/13 ]

We already have this patch: Our RPM was compiled using the v2_3_63_0 tag (which includes commit 2448de6c51ceccea6a308d73d7960f236e0c0847)

Btw: We just had yet another crash, so i could provide yet another 8GB vmcore

Comment by Jinshan Xiong (Inactive) [ 02/May/13 ]

Can you please upload the vmcore along with the kernel modules on the client side? and what's the kernel version #? Just in case, the tip of your git tree is f3ef9ea9, right?

Comment by Adrian Ulrich (Inactive) [ 02/May/13 ]

I'll upload the latest crash + the lustre *.ko-modules to "uploads/LU-2779-2" in a few minutes. (The upload will probably take a few hours to complete :/)

We are running 2.6.32-358.0.1.el6.x86_64

our git-head is slighlty newer:

$ git rev-parse HEAD
b859a51e5fa580797dd833bb8e5ec7d6e41411af

Uhm: Just noticed that the crashed node was still running 2.3.62
So the HEAD used was 87ee788bd137d0d82ca107a7615f18f420a3699a, sorry for mixing this up

Comment by Jinshan Xiong (Inactive) [ 02/May/13 ]

Thank you. I will take a look at it.

Comment by Jinshan Xiong (Inactive) [ 03/May/13 ]

Hi Adrian,

Please check patch at http://review.whamcloud.com/6262, with this patch, there should be no assertion any more.

However, there are other problems from what I can see from the console message. The client was evicted by some OSTs, for example, this kind of message:

<4>Lustre: nero-OST0006-osc-ffff881c3a770400: Connection to nero-OST0006 (at 10.201.62.37@o2ib) was lost; in progress operations using this service will wait for recovery to complete
<3>LustreError: 167-0: nero-OST0006-osc-ffff881c3a770400: This client was evicted by nero-OST0006; in progress operations using this service will fail.

I need the log on the OST side to know what happened.

ALso, there are couple of OOM messages. Can you please show me the output of: lctl get_param llite.*.max_cached_mb and lctl get_param osc.*.cached_mb on the client side? Thanks!

Comment by Adrian Ulrich (Inactive) [ 06/May/13 ]

Thanks for the patch! I'll give it a try.

I know about the eviction and the OOM errors: This is from a login node where people run really ugly stuff (forkbombing themselfs, allocating gigabytes of RAM, etc).

ltcl output:

$ lctl get_param llite.nero-ffff8804388ba800.max_cached_mb
 llite.nero-ffff8804388ba800.max_cached_mb=
 users: 32
 max_cached_mb: 96767
 used_mb: 49481
 unused_mb: 47286
 reclaim_count: 0
$ lctl get_param osc.*.osc_cached_mb
osc.nero-OST0000-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1474
busy_cnt: 0
osc.nero-OST0001-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1297
busy_cnt: 2
osc.nero-OST0002-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1387
busy_cnt: 1
osc.nero-OST0003-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1611
busy_cnt: 2
osc.nero-OST0004-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1593
busy_cnt: 3
osc.nero-OST0005-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1369
busy_cnt: 1
osc.nero-OST0006-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1361
busy_cnt: 0
osc.nero-OST0007-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1253
busy_cnt: 0
osc.nero-OST0008-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1044
busy_cnt: 29
osc.nero-OST0009-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1358
busy_cnt: 2
osc.nero-OST000a-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1075
busy_cnt: 0
osc.nero-OST000b-osc-ffff8804388ba800.osc_cached_mb=used_mb: 947
busy_cnt: 1
osc.nero-OST000c-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1029
busy_cnt: 1
osc.nero-OST000d-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1015
busy_cnt: 25
osc.nero-OST000e-osc-ffff8804388ba800.osc_cached_mb=used_mb: 872
busy_cnt: 2
osc.nero-OST000f-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1324
busy_cnt: 0
osc.nero-OST0010-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1001
busy_cnt: 1
osc.nero-OST0011-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1651
busy_cnt: 1
osc.nero-OST0012-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1379
busy_cnt: 0
osc.nero-OST0013-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1558
busy_cnt: 3
osc.nero-OST0014-osc-ffff8804388ba800.osc_cached_mb=used_mb: 2142
busy_cnt: 1
osc.nero-OST0015-osc-ffff8804388ba800.osc_cached_mb=used_mb: 2085
busy_cnt: 2
osc.nero-OST0016-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1758
busy_cnt: 0
osc.nero-OST0017-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1167
busy_cnt: 1
osc.nero-OST0018-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1551
busy_cnt: 1
osc.nero-OST0019-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1236
busy_cnt: 1
osc.nero-OST001a-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1200
busy_cnt: 2
osc.nero-OST001b-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1463
busy_cnt: 1
osc.nero-OST001c-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1130
busy_cnt: 1
osc.nero-OST001d-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1304
busy_cnt: 2
osc.nero-OST001e-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1516
busy_cnt: 1
osc.nero-OST001f-osc-ffff8804388ba800.osc_cached_mb=used_mb: 1704
busy_cnt: 0

The relevant logs on the OSS during the eviction:

2013-05-02T17:25:40+02:00 n-oss02 LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 12291s: evicting client at 10.201.32.34@o2ib  ns: filter-nero-OST0019_UUID lock: ffff8801b1d256c0/0x35306a2f3a794cd
3 lrc: 3/0,0 mode: PW/PW res: 76295681/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x63d4bbb0c510ffb7 expref: 3351 pid: 4333 timeout 27550303246
2013-05-02T17:25:45+02:00 n-oss02 LustreError: 4489:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-107)  req@ffff8805f1b0e000 x1431201755470460/t0(0) o400-><?>@<?>:0/0 lens 224/0 e 0 to 0 dl 1367508351 ref 1 fl Interpret:H/0/fff
fffff rc -107/-1
2013-05-02T17:38:47+02:00 n-oss02 LustreError: 2752:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
2013-05-02T17:38:47+02:00 n-oss02 LustreError: 2752:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 10.201.32.34@o2ib (68): c: 0, oc: 0, rc: 8
2013-05-02T17:39:32+02:00 n-oss02 LustreError: 138-a: nero-OST0001: A client on nid 10.201.32.34@o2ib was evicted due to a lock glimpse callback time out: rc -4
2013-05-02T18:47:14+02:00 n-oss02 LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 7026s: evicting client at 10.201.43.27@o2ib  ns: filter-nero-OST0001_UUID lock: ffff88047bd866c0/0x35306a2f3c0c65ec
 lrc: 3/0,0 mode: PW/PW res: 76322164/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0xf6d8f9457b07fb6a expref: 176 pid: 3961 timeout 27555197173
2013-05-02T18:47:32+02:00 n-oss02 LustreError: 4288:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-107)  req@ffff8805d7e86400 x1427319253357050/t0(0) o400-><?>@<?>:0/0 lens 224/0 e 0 to 0 dl 1367513258 ref 1 fl Interpret:H/0/fff
fffff rc -107/-1
2013-05-02T20:34:41+02:00 n-oss02 LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 20590s: evicting client at 10.201.35.45@o2ib  ns: filter-nero-OST0009_UUID lock: ffff88082d127900/0x35306a2f3b4ac15
1 lrc: 3/0,0 mode: PW/PW res: 76311076/0 rrc: 2 type: EXT [0->18446744073709551615] (req 8388608->18446744073709551615) flags: 0x20 remote: 0x60a148bfb337040c expref: 208 pid: 4312 timeout 27561644539
2013-05-02T20:35:27+02:00 n-oss02 LustreError: 4288:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-107)  req@ffff8805ff210400 x1427382795207052/t0(0) o400-><?>@<?>:0/0 lens 224/0 e 0 to 0 dl 1367519733 ref 1 fl Interpret:H/0/fff
fffff rc -107/-1

Comment by Jinshan Xiong (Inactive) [ 06/May/13 ]

What's the ip address of the node where crash happened?

Comment by Adrian Ulrich (Inactive) [ 07/May/13 ]

The NID of the evicted and crashed client is: 10.201.32.34@o2ib

I'm not very surprised about the eviction: The client is known to have a pretty bad infiniband interface: We already observed random IB stalls on this hardware type.

Comment by Patrick Farrell (Inactive) [ 14/Aug/13 ]

Jinshan,

Cray has been hitting what appears to be this bug recently. We'll be running with http://review.whamcloud.com/#/c/5419/ on our tests systems and will report back with what we see.

Comment by Cheng Shao (Inactive) [ 21/Aug/13 ]

I understand that we revamped the osc_lock_flush code path to replace the page-based approach as shown in cl_lock_page_out with extent-based one now. In the old code path, we will end up waiting in cl_sync_io_wait and if the first wait timed out, we will enter the second infinite uninterruptible wait anyway. That is equivalent to the effect of applying Jinshan's patch above. In another word, the simple fix doesn't make it worse. Therefore, should we move forward to get it landed?

Comment by Cory Spitz [ 22/Aug/13 ]

This bug has two patches, #5419 and #6262. One should be abandoned.

Comment by Patrick Farrell (Inactive) [ 11/Sep/13 ]

Forgot to update this with our results. We haven't had this issue since landing http://review.whamcloud.com/#/c/5419/.

In addition, we haven't noticed any of the possible issues with unkillable threads.

Comment by Christopher Morrone [ 11/Sep/13 ]

LLNL has been carrying the 5419 patch in our tree and running with it in production.

On one of our smaller BG/Q systems I counted the number of hits of the "wait ext to %d timedout, recovery in progress?" message from the console logs and found it hit 1258 times. There is a fair bit of clustering that I didn't spend the time colapsing, so that may be more like 20-100 times since May.

Comment by Peter Jones [ 24/Sep/13 ]

So, a patch just landed to master for this issue. Is that enough to warrant marking the issue as resolved or is something further required?

Comment by Patrick Farrell (Inactive) [ 24/Sep/13 ]

From the Cray perspective, I don't see anything further needed. LLNL might feel differently.

Comment by Christopher Morrone [ 24/Sep/13 ]

Oh, hmm...actually we are running any earlier version of the patch. I have no idea what to make of the one that landed.

Comment by Marek Magrys [ 24/Sep/13 ]

We also hit this bug on 2.4.0 clients and 2.4.1RC2 servers. I think that if possible the patch should be added to 2.4.2 release, as it is severe.

Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ]

Hi Chris,

What's the version you're running right now?

Jinshan

Comment by Christopher Morrone [ 25/Sep/13 ]

Patch set 2.

Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ]

Patch set 2 has more debug information. The real fix is the same.

Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ]

The reason why this patch was made is that OSC has to wait for the IO RPC to finish anyway, no matter how much time it will need; otherwise it will hit the assertion in discard_cb(). If the OST is in recovery, it may take really long time for that OST to finish the RPC.

Comment by Christopher Morrone [ 01/Oct/13 ]

But the fix looks different. Patch set 4 changes the LWI_INTR interrupt handler from LWI_ON_SIGNAL_NOOP to NULL, and does nothing else. Patch set 2 did not do that. Instead the only significant change in patch set 2 was the removal of the KLASSERT.

I'm missing how those two things are equivalent.

Comment by Jinshan Xiong (Inactive) [ 02/Oct/13 ]

indeed. patch set 2 allows the interrupt by setting interruption callback to LWI_ON_SIGNAL_NOOP. In that case, it makes sense to upgrade to patch set 4.

Generated at Sat Feb 10 01:28:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.