[LU-1723] osc_page_delete()) ASSERTION(0) failed Created: 08/Aug/12  Updated: 17/Jul/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Wojciech Turek (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 2
Labels: None
Environment:

Client RHEL-6.2 2.6.32-220.23.1.el6.x86_64


Attachments: File dk.log.xz     File lustre-log.1344438003.64082.tar.bz2     File sand-4-52_lustre_debug.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 4210

 Description   

LBUG on production cluster while node was running a user job.

Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b5c0[1 ffff8806726e8f48:1263 ^(null)_ffff88106917b500 4 0 1 (null) (null) 0x1]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b500[1 ffff880770b2c4c8:1263 ^ffff88106917b5c0_(null) 4 0 1 (null) (null) 0x0]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) vvp-page@ffff8809d88c6be0(1:0:0) vm@ffffea0034f6ce20 c0000000000001 3:0 0 1263 lru
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) lov-page@ffff8809f22e93a8
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) end page@ffff88106917b5c0
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) Trying to teardown failed: -16
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) ASSERTION(0) failed
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) LBUG
Aug 8 16:00:02 sand-4-52 kernel: Pid: 64082, comm: calculate_propa
Aug 8 16:00:02 sand-4-52 kernel:
Aug 8 16:00:02 sand-4-52 kernel: Call Trace:
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446e95>] lbug_with_loc+0x75/0xe0 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0451d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa07f8df6>] osc_page_delete+0x236/0x240 [osc]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0553d1e>] cl_page_delete0+0xce/0x400 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055128e>] ? cl_env_get+0x19e/0x350 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0550bf6>] ? cl_env_peek+0x36/0x110 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055408d>] cl_page_delete+0x3d/0xf0 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0560bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa08e791b>] ll_releasepage+0x10b/0x150 [lustre]
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81168ff0>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811100b0>] try_to_release_page+0x30/0x60
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112c374>] zone_reclaim+0x354/0x410
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81122874>] get_page_from_freelist+0x694/0x820
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8116a728>] ? __mem_cgroup_try_charge+0x78/0x420
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811586ca>] alloc_pages_vma+0x9a/0x150
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81172015>] do_huge_pmd_anonymous_page+0x145/0x370
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8113c79a>] handle_mm_fault+0x25a/0x2b0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81042c29>] __do_page_fault+0x139/0x480
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814ed250>] ? thread_return+0x4e/0x76e
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f2c8e>] do_page_fault+0x3e/0xa0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f0045>] page_fault+0x25/0x30
Aug 8 16:00:03 sand-4-52 kernel:
Aug 8 16:00:03 sand-4-52 kernel: LustreError: dumping log to /tmp/lustre-log.1344438003.64082



 Comments   
Comment by Jinshan Xiong (Inactive) [ 08/Aug/12 ]

Can you please tell me what exactly version you installed on the client node?

Also, this patch may help: "LU-1320 llite: fix a race between readpage and releasepage".

Comment by Wojciech Turek (Inactive) [ 08/Aug/12 ]

The clients are running lustre-2.1.2

lustre-client-2.1.2-2.6.32_220.23.1.el6.x86_64.x86_64
lustre-client-modules-2.1.2-2.6.32_220.23.1.el6.x86_64.x86_64

Lustre-2.1.2 changelog suggests that it already has a fix for LU-1320

Comment by Jinshan Xiong (Inactive) [ 16/Aug/12 ]

Hi Wojciech Turek,

can you please post osc_page_print() function from your source code here? Also which pages have been applied. Thanks in advance.

Comment by Wojciech Turek (Inactive) [ 17/Aug/12 ]
static int osc_page_print(const struct lu_env *env,
                          const struct cl_page_slice *slice,
                          void *cookie, lu_printer_t printer)
{
        struct osc_page       *opg = cl2osc_page(slice);
        struct osc_async_page *oap = &opg->ops_oap;
        struct osc_object     *obj = cl2osc(slice->cpl_obj);
        struct client_obd     *cli = &osc_export(obj)->exp_obd->u.cli;
        struct lov_oinfo      *loi = obj->oo_oinfo;

        return (*printer)(env, cookie, LUSTRE_OSC_NAME"-page@%p: "
                          "1< %#x %d %u %s %s %s > "
                          "2< "LPU64" %u %u %#x %#x | %p %p %p %p %p > "
                          "3< %s %p %d %lu %d > "
                          "4< %d %d %d %lu %s | %s %s %s %s > "
                          "5< %s %s %s %s | %d %s %s | %d %s %s>\n",
                          opg,
                          /* 1 */
                          oap->oap_magic, oap->oap_cmd,
                          oap->oap_interrupted,
                          osc_list(&oap->oap_pending_item),
                          osc_list(&oap->oap_urgent_item),
                          osc_list(&oap->oap_rpc_item),
                          /* 2 */
                          oap->oap_obj_off, oap->oap_page_off, oap->oap_count,
                          oap->oap_async_flags, oap->oap_brw_flags,
                          oap->oap_request,
                          oap->oap_cli, oap->oap_loi, oap->oap_caller_ops,
                          oap->oap_caller_data,
                          /* 3 */
                          osc_list(&opg->ops_inflight),
                          opg->ops_submitter, opg->ops_transfer_pinned,
                          osc_submit_duration(opg), opg->ops_srvlock,
                          /* 4 */
                          cli->cl_r_in_flight, cli->cl_w_in_flight,
                          cli->cl_max_rpcs_in_flight,
                          cli->cl_avail_grant,
                          osc_list(&cli->cl_cache_waiters),
                          osc_list(&cli->cl_loi_ready_list),
                          osc_list(&cli->cl_loi_hp_ready_list),
                          osc_list(&cli->cl_loi_write_list),
                          osc_list(&cli->cl_loi_read_list),
                          /* 5 */
                          osc_list(&loi->loi_ready_item),
                          osc_list(&loi->loi_hp_ready_item),
                          osc_list(&loi->loi_write_item),
                          osc_list(&loi->loi_read_item),
                          loi->loi_read_lop.lop_num_pending,
                          osc_list(&loi->loi_read_lop.lop_pending),
                          osc_list(&loi->loi_read_lop.lop_urgent),
                          loi->loi_write_lop.lop_num_pending,
                          osc_list(&loi->loi_write_lop.lop_pending),
                          osc_list(&loi->loi_write_lop.lop_urgent));
}
Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ]

Hi Wojciech Turek, Thanks. It looks like the code mismatches the output from your first comment.

BTW, you can use:

{code} /* C code */ {code}

to quote source code on jira.

Comment by Wojciech Turek (Inactive) [ 17/Aug/12 ]

Hi Jinshan,

I don't understand why you say it mismatches, can you elaborate please?

Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ]

Maybe the output was truncated, from the output you posted:

Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>

only paragraph 1 <...> and 2 <...> were printed, but I expect to see all 5 paragraphs.

How easy you can reproduce this problem, if possible can you please trigger it and post the output on jira again?

Comment by Oleg Drokin [ 13/Oct/12 ]

I just hit this running racer on a small local fs (That gets ENOSPC easily):

[12126.733752] LustreError: 11222:0:(vvp_io.c:1038:vvp_io_commit_write()) Write page 0 of inode ffff88004431db20 failed -28
[12126.734303] LustreError: 11222:0:(vvp_io.c:1038:vvp_io_commit_write()) Skipped 708 previous similar messages
[12198.367486] LustreError: 26187:0:(osc_cache.c:2343:osc_teardown_async_page()) extent ffff88002223ef38@{[256 -> 402/511], [2|0|-|cache|wi|ffff880088af7eb0], [602112|147|+|-|ffff880032770ed8|256|(null)]} trunc at 256.
[12198.368328] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) page@ffff880011a6def0[2 ffff8800169d7f88:256 ^(null)_ffff8800868dfef0 4 0 1 (null) (null) 0x0]
[12198.369065] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) page@ffff8800868dfef0[1 ffff88004b855f98:256 ^ffff880011a6def0_(null) 4 0 1 (null) (null) 0x0]
[12198.369829] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88009f775af8(0:0:0) vm@ffffea00011fc5f8 20000000000035 2:0 0 256 lru
[12198.370386] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff880010cfbae8
[12198.370856] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88008e13ef08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88006f5c8cb0 ffff880088af7eb0 > 3< + ffff880051112540 0 0 0 > 4< 0 0 8 35164160 - | - - + - > 5< - - + - | 0 - | 147 - ->
[12198.371911] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) end page@ffff880011a6def0
[12198.372377] LustreError: 26187:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
[12198.372850] LustreError: 26187:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed: 
[12198.373329] LustreError: 26187:0:(osc_page.c:412:osc_page_delete()) LBUG
[12198.373602] Pid: 26187, comm: ldlm_bl_21
[12198.373826] 
[12198.373827] Call Trace:
[12198.374850]  [<ffffffffa0caa915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[12198.375155]  [<ffffffffa0caaf27>] lbug_with_loc+0x47/0xb0 [libcfs]
[12198.375436]  [<ffffffffa04cec1d>] osc_page_delete+0x29d/0x2a0 [osc]
[12198.375752]  [<ffffffffa11d8e65>] cl_page_delete0+0xd5/0x500 [obdclass]
[12198.376052]  [<ffffffffa11d92d2>] cl_page_delete+0x42/0x120 [obdclass]
[12198.376357]  [<ffffffffa0ac37cd>] ll_invalidatepage+0x8d/0x180 [lustre]
[12198.376692]  [<ffffffffa0ad592f>] vvp_page_discard+0x9f/0x140 [lustre]
[12198.376989]  [<ffffffffa11d6358>] cl_page_invoid+0x78/0x170 [obdclass]
[12198.377307]  [<ffffffffa11d68a8>] ? cl_page_invoke+0x158/0x1c0 [obdclass]
[12198.377607]  [<ffffffffa11d6463>] cl_page_discard+0x13/0x20 [obdclass]
[12198.377903]  [<ffffffffa11e08ae>] check_and_discard_cb+0x13e/0x160 [obdclass]
[12198.378220]  [<ffffffffa11db54c>] cl_page_gang_lookup+0x1fc/0x410 [obdclass]
[12198.378527]  [<ffffffffa11e0770>] ? check_and_discard_cb+0x0/0x160 [obdclass]
[12198.378834]  [<ffffffffa11e0770>] ? check_and_discard_cb+0x0/0x160 [obdclass]
[12198.379144]  [<ffffffffa11de03e>] cl_lock_discard_pages+0x11e/0x1f0 [obdclass]
[12198.379607]  [<ffffffffa11d33d5>] ? cl_env_info+0x15/0x20 [obdclass]
[12198.379888]  [<ffffffffa04cfb80>] osc_lock_flush+0x110/0x200 [osc]
[12198.380167]  [<ffffffffa04cfcc9>] osc_lock_cancel+0x59/0x1a0 [osc]
[12198.380454]  [<ffffffffa11dbda5>] cl_lock_cancel0+0x75/0x160 [obdclass]
[12198.380752]  [<ffffffffa11dca0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[12198.381045]  [<ffffffffa04d19ba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[12198.381368]  [<ffffffffa12dbaf3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[12198.381831]  [<ffffffffa12dbf31>] ldlm_bl_thread_main+0x281/0x3e0 [ptlrpc]
[12198.382119]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[12198.382408]  [<ffffffffa12dbcb0>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[12198.382692]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[12198.382958]  [<ffffffffa12dbcb0>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[12198.383826]  [<ffffffffa12dbcb0>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[12198.384126]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[12198.384378] 
[12198.455840] Kernel panic - not syncing: LBUG
[12198.456115] Pid: 26187, comm: ldlm_bl_21 Not tainted 2.6.32-debug #6
[12198.456423] Call Trace:
[12198.456632]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[12198.456887]  [<ffffffffa0caaf7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[12198.457201]  [<ffffffffa04cec1d>] ? osc_page_delete+0x29d/0x2a0 [osc]
[12198.457516]  [<ffffffffa11d8e65>] ? cl_page_delete0+0xd5/0x500 [obdclass]
[12198.457816]  [<ffffffffa11d92d2>] ? cl_page_delete+0x42/0x120 [obdclass]
[12198.458127]  [<ffffffffa0ac37cd>] ? ll_invalidatepage+0x8d/0x180 [lustre]
[12198.458420]  [<ffffffffa0ad592f>] ? vvp_page_discard+0x9f/0x140 [lustre]
[12198.458718]  [<ffffffffa11d6358>] ? cl_page_invoid+0x78/0x170 [obdclass]
[12198.459014]  [<ffffffffa11d68a8>] ? cl_page_invoke+0x158/0x1c0 [obdclass]
[12198.459328]  [<ffffffffa11d6463>] ? cl_page_discard+0x13/0x20 [obdclass]
[12198.459627]  [<ffffffffa11e08ae>] ? check_and_discard_cb+0x13e/0x160 [obdclass]
[12198.460126]  [<ffffffffa11db54c>] ? cl_page_gang_lookup+0x1fc/0x410 [obdclass]
[12198.460617]  [<ffffffffa11e0770>] ? check_and_discard_cb+0x0/0x160 [obdclass]
[12198.460970]  [<ffffffffa11e0770>] ? check_and_discard_cb+0x0/0x160 [obdclass]
[12198.462162]  [<ffffffffa11de03e>] ? cl_lock_discard_pages+0x11e/0x1f0 [obdclass]
[12198.462646]  [<ffffffffa11d33d5>] ? cl_env_info+0x15/0x20 [obdclass]
[12198.462932]  [<ffffffffa04cfb80>] ? osc_lock_flush+0x110/0x200 [osc]
[12198.463225]  [<ffffffffa04cfcc9>] ? osc_lock_cancel+0x59/0x1a0 [osc]
[12198.463533]  [<ffffffffa11dbda5>] ? cl_lock_cancel0+0x75/0x160 [obdclass]
[12198.463829]  [<ffffffffa11dca0b>] ? cl_lock_cancel+0x13b/0x140 [obdclass]
[12198.464120]  [<ffffffffa04d19ba>] ? osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[12198.464436]  [<ffffffffa12dbaf3>] ? ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[12198.464898]  [<ffffffffa12dbf31>] ? ldlm_bl_thread_main+0x281/0x3e0 [ptlrpc]
[12198.465240]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[12198.465533]  [<ffffffffa12dbcb0>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[12198.465815]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[12198.466084]  [<ffffffffa12dbcb0>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[12198.466393]  [<ffffffffa12dbcb0>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[12198.466699]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

I have vmcore for this occurence

Comment by Jinshan Xiong (Inactive) [ 20/Oct/12 ]

duplicate of LU-2170

Comment by Bruce Korb (Inactive) [ 14/Nov/12 ]

I am a little unclear as to how this was diagnosed as a duplicate of LU-2170.
The assertions are different. This "assert(0)" was initiated because
osc_page_delete() was called while the opg->ops_oap field referenced a
osc_async_page structure with a non-empty oap_rpc_item list. (That caused
osc_teardown_async_page to return EBUSY (16) and osc_page_delete handles
that by asserting that zero is not zero.)

So, I don't see the obvious reason the osc_extent_merge assertion would be
seen as a duplicate of this issue.

Comment by Jinshan Xiong (Inactive) [ 14/Nov/12 ]

In LU-2170, I discovered that a write lock can be matched and enqueued by a glimpse process so that the write lock will be marked with INTENT(this write lock was wrongly converted to glimpse lock) in the enqueue RPC, then the lock was denied by OST, then the client calls cl_lock_delete() to delete this lock and grants cl_lock to make it go through with glimpse. After glimpse process finishes to use this lock, it will be deleted w/o canceling.

Depending on the time when this lock is found by other process, difference symptoms will show up. For example, in osc_extent_merge(), I discovered that one of the extent's oe_lock pointed to that "glimpse write" lock, and another extent had the normal one. For the osc_page_delete() case, that was because one dirty page was added into the "glimpse write" lock, since there is no cancel for this lock, the page still remained in the cache after the lock was deleted. Afterwards, when a read lock was enqueued and canceled, it discovered this page and then hit the assert.

Comment by Bruce Korb (Inactive) [ 14/Nov/12 ]

OK, well we (Cory at Cray) see the same symptom, even after the LU-2170 patch.
So there must be another cause for this.

> Denis Kondratenko added a comment - 06/Nov/12 4:31 AM
> Wally,
> was this fixed by http://jira.whamcloud.com/browse/LU-2170 ?
> [...]
> Wally Wang added a comment - 06/Nov/12 3:19 PM
> No, attached is the SP2 version of the cdump and it is with the LU-2170 applied.

The SP2 version of the cdump was, unfortunately, corrupt, so I've been futzing
with the pre-SP2 version. That earlier dump may well have been caused by the
issue fixed in LU-2170, but the later dump not. (Presuming Wally is correct about
the patch having been applied.)

I can attach the SP2 console log, if you like. But not the "lctl dk" dump,
pending another crash dump.

Comment by Jinshan Xiong (Inactive) [ 14/Nov/12 ]

this is a different issue from what have described by Oleg

Comment by Bruce Korb (Inactive) [ 14/Nov/12 ]

OK. Thank you. It just looks a lot alike.
I will keep reading code.

2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@

{[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]}

trunc at 23.
2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0[2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00[1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru
2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888
2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - ->
2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0
2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux
2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace:
2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0
2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440
2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs]
2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc]
2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass]
2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass]
2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre]
2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre]
2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570
2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0
2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240
2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410
2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30
2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3
2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG

Comment by Bruce Korb (Inactive) [ 15/Nov/12 ]

I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info.
NOTE: I used a script to edit the time into a format compatible with the console log.
This is the "lctl dk" output.

2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete
2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN
2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger

2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN
2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT))
2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff880815013400

2012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4
2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED

2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128
2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED
2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail.
2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating

The connection recovers, according to logs:

2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL
2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active
2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128
2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni)

Then about 400,000 copies of this:

2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301

followed 0.2 seconds with the LBUG:

2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@

Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]}

trunc at 13.
2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0[2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1]
2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840[1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0]
2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru
2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798
2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - ->
2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0
2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412

Comment by Bruce Korb (Inactive) [ 30/Nov/12 ]

Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.

Comment by Bruce Korb (Inactive) [ 30/Nov/12 ]

With a crash dump with debug tracing turned on, we get a lot more useful information.

Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM

In fact that is bug in new osc extent code landed into 2.3 :/
both code path
1) ldlm lock cancel -> page discard ....
and
2) mkwrite /race/ -> invalidate page

forget to find a extent and set correct state for it.  looks osc
extent author forget about forced page flush from a page cache with
mkwrite race and client eviction.

for first case that is more or less clear - cl_lock_discard_pages
foget to find an extent and set truncate state, because use a old API
and directly kill a pages. that is may be fixed via replace
cl_lock_discard_pages [with a] osc_extent_truncate_start or something
similar.

but for second case it's not clear for me. [It] looks [like] we need
[to] completely change code near of mkwrite and page invalidation.

[Otherwise] (may be dirty hack) - we change osc_page_delete to ignore
error in osc_teardown_async_page for some 'force' cases or (better)
correctly abort an IO request for that page if page discard need (page
may be in readahead as was in
790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).
Comment by Bruce Korb (Inactive) [ 07/Dec/12 ]

More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed)

Iurii Golovach added a comment - 06/Dec/12 4:34 AM

During implementation of extents on OSC level, two cases fell between cracks:

  • evict;
  • invalidate page.
    Like a result extent flags for these cases weren't set and we have a race.

Since adding these cases is quite complex task we recommend to do next:
1) revert changes from next patches:
http://review.whamcloud.com/#change,2270
http://review.whamcloud.com/#change,2009
http://review.whamcloud.com/#change,2460
2) wait until Jinshan finish his work on LU-1723 and then use changes with correct extents support.

Please let us know if you need patch against 2.3 to reverse these patches.

Cory Spitz added a comment - 06/Dec/12 2:42 PM

Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/LU-1723 or are you just waiting for him?

Comment by Bruce Korb (Inactive) [ 19/Dec/12 ]

In today's Xyratex/Cray conference call, it was decided to go forward with backing out the LU-1030 patches from our release branch, pending further work on them.

Comment by Cory Spitz [ 19/Dec/12 ]

I think that this should be a 2.4 release blocker.

Comment by Oleg Drokin [ 23/Dec/12 ]

Hm, I think I just hit this issue on my burn-in cluster as well:

[56349.656614] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800805c6f08[3 ffff88000bf11f80:256 ^(null)_ffff880014be0f08 0 0 1 (null) (null) 0x0]
[56349.657354] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880014be0f08[2 ffff88002a8ecf98:256 ^ffff8800805c6f08_(null) 0 0 1 (null) (null) 0x0]
[56349.658145] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65eb60(0:0:0) vm@ffffea00021b5a58 2000000000083c 2:0 ffff8800805c6f08 256 lru
[56349.658890] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510ee00
[56349.659360] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880062170f08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.660491] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800805c6f08
[56349.660996] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.661284] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880014be1f08[3 ffff88000bf11f80:257 ^(null)_ffff88001333ef08 0 0 1 (null) (null) 0x0]
[56349.662042] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88001333ef08[2 ffff88002a8ecf98:257 ^ffff880014be1f08_(null) 0 0 1 (null) (null) 0x0]
[56349.662802] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ebc8(0:0:0) vm@ffffea0001cef2d0 2000000000083c 2:0 ffff880014be1f08 257 lru
[56349.663548] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510e620
[56349.664016] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88002b782f08: 1< 0x845fed 258 0 + - > 2< 1052672 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.665336] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880014be1f08
[56349.665805] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.666679] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880093a26f08[3 ffff88000bf11f80:258 ^(null)_ffff88003c596f08 0 0 1 (null) (null) 0x0]
[56349.667419] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88003c596f08[2 ffff88002a8ecf98:258 ^ffff880093a26f08_(null) 0 0 1 (null) (null) 0x0]
[56349.668166] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ec30(0:0:0) vm@ffffea0002285560 2000000000083c 2:0 ffff880093a26f08 258 lru
[56349.668917] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510ece0
[56349.669389] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88005bf32f08: 1< 0x845fed 258 0 + - > 2< 1056768 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.671402] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880093a26f08
[56349.671883] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.672186] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800304a3f08[3 ffff88000bf11f80:259 ^(null)_ffff880032388f08 0 0 1 (null) (null) 0x0]
[56349.672953] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880032388f08[2 ffff88002a8ecf98:259 ^ffff8800304a3f08_(null) 0 0 1 (null) (null) 0x0]
[56349.673667] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ec98(0:0:0) vm@ffffea0001d40e80 2000000000083c 2:0 ffff8800304a3f08 259 lru
[56349.674403] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff8800740957d0
[56349.674898] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff8800922cff08: 1< 0x845fed 258 0 + - > 2< 1060864 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.676000] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800304a3f08
[56349.676500] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.676872] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800304a2f08[3 ffff88000bf11f80:260 ^(null)_ffff8800639baf08 0 0 1 (null) (null) 0x0]
[56349.677605] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800639baf08[2 ffff88002a8ecf98:260 ^ffff8800304a2f08_(null) 0 0 1 (null) (null) 0x0]
[56349.678374] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ed00(0:0:0) vm@ffffea0002114048 2000000000083c 2:0 ffff8800304a2f08 260 lru
[56349.679110] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095c98
[56349.679560] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880081cf0f08: 1< 0x845fed 258 0 + - > 2< 1064960 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.680665] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800304a2f08
[56349.681134] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.681428] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800639bbf08[3 ffff88000bf11f80:261 ^(null)_ffff8800223a9f08 0 0 1 (null) (null) 0x0]
[56349.682160] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800223a9f08[2 ffff88002a8ecf98:261 ^ffff8800639bbf08_(null) 0 0 1 (null) (null) 0x0]
[56349.682893] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ed68(0:0:0) vm@ffffea000218f5a0 2000000000083c 2:0 ffff8800639bbf08 261 lru
[56349.683625] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095ce0
[56349.684092] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88001309df08: 1< 0x845fed 258 0 + - > 2< 1069056 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.685177] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800639bbf08
[56349.685623] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.685921] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800223a8f08[3 ffff88000bf11f80:262 ^(null)_ffff8800248a3f08 0 0 1 (null) (null) 0x0]
[56349.686682] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800248a3f08[2 ffff88002a8ecf98:262 ^ffff8800223a8f08_(null) 0 0 1 (null) (null) 0x0]
[56349.687427] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0ee0d0(0:0:0) vm@ffffea0001f0b178 2000000000083c 2:0 ffff8800223a8f08 262 lru
[56349.688184] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095ae8
[56349.688665] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88008e200f08: 1< 0x845fed 258 0 + - > 2< 1073152 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.689748] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800223a8f08
[56349.690214] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.690501] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880075c2ef08[3 ffff88000bf11f80:263 ^(null)_ffff880033ee2f08 0 0 1 (null) (null) 0x0]
[56349.691237] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880033ee2f08[2 ffff88002a8ecf98:263 ^ffff880075c2ef08_(null) 0 0 1 (null) (null) 0x0]
[56349.691967] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0ee4e0(0:0:0) vm@ffffea00020fec90 2000000000083c 2:0 ffff880075c2ef08 263 lru
[56349.692706] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095aa0
[56349.693172] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880093ce0f08: 1< 0x845fed 258 0 + - > 2< 1077248 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.694248] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880075c2ef08
[56349.694717] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.695010] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880033ee3f08[3 ffff88000bf11f80:264 ^(null)_ffff880061765f08 0 0 1 (null) (null) 0x0]
[56349.696293] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880061765f08[2 ffff88002a8ecf98:264 ^ffff880033ee3f08_(null) 0 0 1 (null) (null) 0x0]
[56349.697945] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0eea90(0:0:0) vm@ffffea0001f481f8 2000000000083c 2:0 ffff880033ee3f08 264 lru
[56349.698697] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095740
[56349.699176] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff8800794ccf08: 1< 0x845fed 258 0 + - > 2< 1081344 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.700246] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880033ee3f08
[56349.700732] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.701026] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800393f6f08[3 ffff88000bf11f80:265 ^(null)_ffff88008b6e9f08 0 0 1 (null) (null) 0x0]
[56349.701758] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88008b6e9f08[2 ffff88002a8ecf98:265 ^ffff8800393f6f08_(null) 0 0 1 (null) (null) 0x0]
[56349.702492] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0eec30(0:0:0) vm@ffffea000277cc78 2000000000083c 2:0 ffff8800393f6f08 265 lru
[56349.703850] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095110
[56349.704321] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88006b098f08: 1< 0x845fed 258 0 + - > 2< 1085440 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.705396] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800393f6f08
[56349.706003] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp.
[56349.758028] LustreError: 20323:0:(osc_cache.c:2370:osc_teardown_async_page()) extent ffff88006e1f1f38@{[256 -> 506/511], [2|0|-|cache|wi|ffff880078590eb0], [1028096|251|+|-|ffff88002b60ced8|256|(null)]} trunc at 256.
[56349.759172] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) page@ffff8800805c6f08[2 ffff88000bf11f80:256 ^(null)_ffff880014be0f08 4 0 1 (null) (null) 0x0]
[56349.759981] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) page@ffff880014be0f08[2 ffff88002a8ecf98:256 ^ffff8800805c6f08_(null) 4 0 1 (null) (null) 0x0]
[56349.760931] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) vvp-page@ffff88006e65eb60(0:0:0) vm@ffffea00021b5a58 20000000000035 3:0 0 256 lru
[56349.761537] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) lov-page@ffff88002510ee00
[56349.762131] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) osc-page@ffff880062170f08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 0 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - ->
[56349.763222] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) end page@ffff8800805c6f08
[56349.763780] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) Trying to teardown failed: -16
[56349.764587] LustreError: 20323:0:(osc_page.c:432:osc_page_delete()) ASSERTION( 0 ) failed: 
[56349.765113] LustreError: 20323:0:(osc_page.c:432:osc_page_delete()) LBUG
[56349.765406] Pid: 20323, comm: cp
[56349.765648] 
[56349.765648] Call Trace:
[56349.766258]  [<ffffffffa0405915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[56349.766719]  [<ffffffffa0405f27>] lbug_with_loc+0x47/0xb0 [libcfs]
[56349.767070]  [<ffffffffa09bd5dd>] osc_page_delete+0x32d/0x330 [osc]
[56349.767381]  [<ffffffffa05d6b15>] cl_page_delete0+0xc5/0x510 [obdclass]
[56349.767713]  [<ffffffffa05d6fa2>] cl_page_delete+0x42/0x120 [obdclass]
[56349.768192]  [<ffffffffa0e4025d>] ll_invalidatepage+0x8d/0x180 [lustre]
[56349.768623]  [<ffffffff81129ee8>] do_invalidatepage+0x28/0x30
[56349.768950]  [<ffffffff8112a102>] truncate_inode_page+0xa2/0xc0
[56349.769216]  [<ffffffff8112a42a>] truncate_inode_pages_range+0x15a/0x460
[56349.769515]  [<ffffffff8112a745>] truncate_inode_pages+0x15/0x20
[56349.769877]  [<ffffffff8112a797>] truncate_pagecache+0x47/0x70
[56349.770297]  [<ffffffff8112a7d9>] truncate_setsize+0x19/0x20
[56349.770657]  [<ffffffff8112a81e>] vmtruncate+0x3e/0x70
[56349.771010]  [<ffffffffa0e53a53>] vvp_io_setattr_end+0x63/0x80 [lustre]
[56349.771349]  [<ffffffffa05dec60>] cl_io_end+0x60/0x150 [obdclass]
[56349.771769]  [<ffffffffa05e3d82>] cl_io_loop+0xc2/0x1b0 [obdclass]
[56349.772084]  [<ffffffffa0e4c448>] cl_setattr_ost+0x208/0x2d0 [lustre]
[56349.772389]  [<ffffffffa0e1c122>] ll_setattr_raw+0x792/0xfb0 [lustre]
[56349.772706]  [<ffffffffa0e1c99b>] ll_setattr+0x5b/0xf0 [lustre]
[56349.772987]  [<ffffffff81197688>] notify_change+0x168/0x340
[56349.773252]  [<ffffffff81179af4>] do_truncate+0x64/0xa0
[56349.773530]  [<ffffffff8118c309>] do_filp_open+0x829/0xd90
[56349.773854]  [<ffffffff8118827b>] ? getname+0x3b/0x250
[56349.774277]  [<ffffffff81044f4e>] ? kernel_map_pages+0xfe/0x110
[56349.774708]  [<ffffffff8119861b>] ? alloc_fd+0xab/0x160
[56349.775166]  [<ffffffff81178839>] do_sys_open+0x69/0x140
[56349.775569]  [<ffffffff81178950>] sys_open+0x20/0x30
[56349.775968]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[56349.776371] 
[56349.777741] Kernel panic - not syncing: LBUG

vmcore and modules are in /exports/crashdumps/192.168.10.218-2012-12-23-08:15:29

Comment by Jinshan Xiong (Inactive) [ 26/Dec/12 ]

Hi Oleg, please file a new bug for this because it seems like a new issue.

From the stacktrace and crashdump, an extent was not truncated by osc_cache_truncate_start() somehow.

Comment by Bruce Korb (Inactive) [ 02/Jan/13 ]

Since this was filed against 2.1.2 and the problem we are trying to deal with only occurs on the 2.3 source base, we have opened a new bug report: LU-2557. It shows exactly the symptom Oleg displayed above, which is the same as my comment from November:

http://jira.whamcloud.com/browse/LU-1723?focusedCommentId=47889&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-47889

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:19:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.