[LU-2779] LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page))) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2
Affects Version/s: Lustre 2.4.0
Labels:
- sequoia
- topsequoia

Severity:
3
Rank (Obsolete):
6732

Description

We hit this on Sequoia during shutdown time, I don't see an existing bug open for this crash either:

2013-02-01 14:35:53.812430 {R4-llnl} [bgqadmin]{5}.3.1: 
2013-02-01 14:35:53.812843 {R4-llnl} [bgqadmin]{5}.3.1: Broadcast message from root@seqio262-ib0
2013-02-01 14:35:53.813165 {R4-llnl} [bgqadmin]{5}.3.1: 	(unknown) at 14:35 ...
2013-02-01 14:35:53.813752 {R4-llnl} [bgqadmin]{5}.3.1: The system is going down for halt NOW!
2013-02-01 14:35:53.814093 {R4-llnl} [bgqadmin]{5}.2.3: Stopping Common I/O Services: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed: 
2013-02-01 14:35:53.814429 {R4-llnl} [bgqadmin]{5}.2.3: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) LBUG
2013-02-01 14:35:53.814746 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
2013-02-01 14:35:53.815076 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b40] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-02-01 14:35:53.815397 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392bf0] [8000000000ab0c88] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
2013-02-01 14:35:53.815717 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab1450] .lbug_with_loc+0x50/0xc0 [libcfs]
2013-02-01 14:35:53.816042 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
2013-02-01 14:35:53.816392 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
2013-02-01 14:35:53.816732 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
2013-02-01 14:35:53.817047 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
2013-02-01 14:35:53.817363 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
2013-02-01 14:35:53.817877 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
2013-02-01 14:35:53.818248 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
2013-02-01 14:35:53.818565 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
2013-02-01 14:35:53.818901 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
2013-02-01 14:35:53.819221 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
2013-02-01 14:35:53.819535 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
2013-02-01 14:35:53.819908 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
2013-02-01 14:35:53.820387 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
2013-02-01 14:35:53.820707 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-01 14:35:53.821021 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-01 14:35:53.821337 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-01 14:35:53.821652 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-01 14:35:53.821966 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
2013-02-01 14:35:53.822282 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
2013-02-01 14:35:53.822595 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
2013-02-01 14:35:53.822909 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
2013-02-01 14:35:53.823222 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c
2013-02-01 14:35:53.823534 {R4-llnl} [bgqadmin]{5}.2.3: Kernel panic - not syncing: LBUG
2013-02-01 14:35:53.823844 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
2013-02-01 14:35:53.824153 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
2013-02-01 14:35:53.824466 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392c10] [c000000000431ef4] .panic+0x80/0x1ac
2013-02-01 14:35:53.824776 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab14b0] .lbug_with_loc+0xb0/0xc0 [libcfs]
2013-02-01 14:35:53.825089 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
2013-02-01 14:35:53.825401 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
2013-02-01 14:35:53.825721 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
2013-02-01 14:35:53.826045 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
2013-02-01 14:35:53.826358 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
2013-02-01 14:35:53.826670 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
2013-02-01 14:35:53.826982 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
2013-02-01 14:35:53.827295 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
2013-02-01 14:35:53.827608 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
2013-02-01 14:35:53.827920 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
2013-02-01 14:35:53.828232 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
2013-02-01 14:35:53.828649 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
2013-02-01 14:35:53.829092 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
2013-02-01 14:35:53.829361 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
2013-02-01 14:35:53.829629 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
2013-02-01 14:35:53.829892 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
2013-02-01 14:35:53.830155 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
2013-02-01 14:35:53.830420 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
2013-02-01 14:35:53.830686 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
2013-02-01 14:35:53.830952 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
2013-02-01 14:35:53.831217 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
2013-02-01 14:35:53.831483 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console.rzuseqlac2.bz2
10 kB
19/Feb/13 6:07 PM

Issue Links

duplicates

LU-2683 Client deadlock in cl_lock_mutex_get

Resolved

is duplicated by

LU-4581 ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed:

Resolved

Activity

[LU-2779] LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))

Jinshan Xiong (Inactive) added a comment - 25/Sep/13 9:48 PM

Patch set 2 has more debug information. The real fix is the same.

Jinshan Xiong (Inactive) added a comment - 25/Sep/13 9:48 PM Patch set 2 has more debug information. The real fix is the same.

Christopher Morrone (Inactive) added a comment - 25/Sep/13 6:33 PM

Patch set 2.

Christopher Morrone (Inactive) added a comment - 25/Sep/13 6:33 PM Patch set 2.

Jinshan Xiong (Inactive) added a comment - 25/Sep/13 4:09 PM

Hi Chris,

What's the version you're running right now?

Jinshan

Jinshan Xiong (Inactive) added a comment - 25/Sep/13 4:09 PM Hi Chris, What's the version you're running right now? Jinshan

Marek Magrys added a comment - 24/Sep/13 10:02 PM

We also hit this bug on 2.4.0 clients and 2.4.1RC2 servers. I think that if possible the patch should be added to 2.4.2 release, as it is severe.

Marek Magrys added a comment - 24/Sep/13 10:02 PM We also hit this bug on 2.4.0 clients and 2.4.1RC2 servers. I think that if possible the patch should be added to 2.4.2 release, as it is severe.

Christopher Morrone (Inactive) added a comment - 24/Sep/13 9:56 PM

Oh, hmm...actually we are running any earlier version of the patch. I have no idea what to make of the one that landed.

Christopher Morrone (Inactive) added a comment - 24/Sep/13 9:56 PM Oh, hmm...actually we are running any earlier version of the patch. I have no idea what to make of the one that landed.

Patrick Farrell (Inactive) added a comment - 24/Sep/13 7:11 PM

From the Cray perspective, I don't see anything further needed. LLNL might feel differently.

Patrick Farrell (Inactive) added a comment - 24/Sep/13 7:11 PM From the Cray perspective, I don't see anything further needed. LLNL might feel differently.

Peter Jones added a comment - 24/Sep/13 7:01 PM

So, a patch just landed to master for this issue. Is that enough to warrant marking the issue as resolved or is something further required?

Peter Jones added a comment - 24/Sep/13 7:01 PM So, a patch just landed to master for this issue. Is that enough to warrant marking the issue as resolved or is something further required?

Christopher Morrone (Inactive) added a comment - 11/Sep/13 5:25 PM

LLNL has been carrying the 5419 patch in our tree and running with it in production.

On one of our smaller BG/Q systems I counted the number of hits of the "wait ext to %d timedout, recovery in progress?" message from the console logs and found it hit 1258 times. There is a fair bit of clustering that I didn't spend the time colapsing, so that may be more like 20-100 times since May.

Christopher Morrone (Inactive) added a comment - 11/Sep/13 5:25 PM LLNL has been carrying the 5419 patch in our tree and running with it in production. On one of our smaller BG/Q systems I counted the number of hits of the "wait ext to %d timedout, recovery in progress?" message from the console logs and found it hit 1258 times. There is a fair bit of clustering that I didn't spend the time colapsing, so that may be more like 20-100 times since May.

Patrick Farrell (Inactive) added a comment - 11/Sep/13 3:49 PM - edited

Forgot to update this with our results. We haven't had this issue since landing http://review.whamcloud.com/#/c/5419/.

In addition, we haven't noticed any of the possible issues with unkillable threads.

Patrick Farrell (Inactive) added a comment - 11/Sep/13 3:49 PM - edited Forgot to update this with our results. We haven't had this issue since landing http://review.whamcloud.com/#/c/5419/ . In addition, we haven't noticed any of the possible issues with unkillable threads.

Cory Spitz added a comment - 22/Aug/13 6:29 PM

This bug has two patches, #5419 and #6262. One should be abandoned.

Cory Spitz added a comment - 22/Aug/13 6:29 PM This bug has two patches, #5419 and #6262. One should be abandoned.

Cheng Shao (Inactive) added a comment - 21/Aug/13 11:15 PM

I understand that we revamped the osc_lock_flush code path to replace the page-based approach as shown in cl_lock_page_out with extent-based one now. In the old code path, we will end up waiting in cl_sync_io_wait and if the first wait timed out, we will enter the second infinite uninterruptible wait anyway. That is equivalent to the effect of applying Jinshan's patch above. In another word, the simple fix doesn't make it worse. Therefore, should we move forward to get it landed?

Cheng Shao (Inactive) added a comment - 21/Aug/13 11:15 PM I understand that we revamped the osc_lock_flush code path to replace the page-based approach as shown in cl_lock_page_out with extent-based one now. In the old code path, we will end up waiting in cl_sync_io_wait and if the first wait timed out, we will enter the second infinite uninterruptible wait anyway. That is equivalent to the effect of applying Jinshan's patch above. In another word, the simple fix doesn't make it worse. Therefore, should we move forward to get it landed?

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Prakash Surya (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 07/Feb/13 1:39 PM

Updated:: 07/Aug/15 5:25 PM

Resolved:: 25/Sep/13 9:48 PM