Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1723

osc_page_delete()) ASSERTION(0) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • None
    • Client RHEL-6.2 2.6.32-220.23.1.el6.x86_64
    • 3
    • 4210

    Description

      LBUG on production cluster while node was running a user job.

      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b5c0[1 ffff8806726e8f48:1263 ^(null)_ffff88106917b500 4 0 1 (null) (null) 0x1]
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b500[1 ffff880770b2c4c8:1263 ^ffff88106917b5c0_(null) 4 0 1 (null) (null) 0x0]
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) vvp-page@ffff8809d88c6be0(1:0:0) vm@ffffea0034f6ce20 c0000000000001 3:0 0 1263 lru
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) lov-page@ffff8809f22e93a8
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) end page@ffff88106917b5c0
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) Trying to teardown failed: -16
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) ASSERTION(0) failed
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) LBUG
      Aug 8 16:00:02 sand-4-52 kernel: Pid: 64082, comm: calculate_propa
      Aug 8 16:00:02 sand-4-52 kernel:
      Aug 8 16:00:02 sand-4-52 kernel: Call Trace:
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0451d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa07f8df6>] osc_page_delete+0x236/0x240 [osc]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0553d1e>] cl_page_delete0+0xce/0x400 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055128e>] ? cl_env_get+0x19e/0x350 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0550bf6>] ? cl_env_peek+0x36/0x110 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055408d>] cl_page_delete+0x3d/0xf0 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0560bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa08e791b>] ll_releasepage+0x10b/0x150 [lustre]
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81168ff0>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811100b0>] try_to_release_page+0x30/0x60
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112c374>] zone_reclaim+0x354/0x410
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81122874>] get_page_from_freelist+0x694/0x820
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8116a728>] ? __mem_cgroup_try_charge+0x78/0x420
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811586ca>] alloc_pages_vma+0x9a/0x150
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81172015>] do_huge_pmd_anonymous_page+0x145/0x370
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8113c79a>] handle_mm_fault+0x25a/0x2b0
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81042c29>] __do_page_fault+0x139/0x480
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814ed250>] ? thread_return+0x4e/0x76e
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f2c8e>] do_page_fault+0x3e/0xa0
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f0045>] page_fault+0x25/0x30
      Aug 8 16:00:03 sand-4-52 kernel:
      Aug 8 16:00:03 sand-4-52 kernel: LustreError: dumping log to /tmp/lustre-log.1344438003.64082

      Attachments

        Activity

          [LU-1723] osc_page_delete()) ASSERTION(0) failed

          Maybe the output was truncated, from the output you posted:

          Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
          

          only paragraph 1 <...> and 2 <...> were printed, but I expect to see all 5 paragraphs.

          How easy you can reproduce this problem, if possible can you please trigger it and post the output on jira again?

          jay Jinshan Xiong (Inactive) added a comment - Maybe the output was truncated, from the output you posted: Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff> only paragraph 1 <...> and 2 <...> were printed, but I expect to see all 5 paragraphs. How easy you can reproduce this problem, if possible can you please trigger it and post the output on jira again?

          Hi Jinshan,

          I don't understand why you say it mismatches, can you elaborate please?

          wjt27 Wojciech Turek added a comment - Hi Jinshan, I don't understand why you say it mismatches, can you elaborate please?
          jay Jinshan Xiong (Inactive) added a comment - - edited

          Hi Wojciech Turek, Thanks. It looks like the code mismatches the output from your first comment.

          BTW, you can use:

          {code} /* C code */ {code}

          to quote source code on jira.

          jay Jinshan Xiong (Inactive) added a comment - - edited Hi Wojciech Turek, Thanks. It looks like the code mismatches the output from your first comment. BTW, you can use: {code} /* C code */ {code} to quote source code on jira.
          wjt27 Wojciech Turek added a comment - - edited
          static int osc_page_print(const struct lu_env *env,
                                    const struct cl_page_slice *slice,
                                    void *cookie, lu_printer_t printer)
          {
                  struct osc_page       *opg = cl2osc_page(slice);
                  struct osc_async_page *oap = &opg->ops_oap;
                  struct osc_object     *obj = cl2osc(slice->cpl_obj);
                  struct client_obd     *cli = &osc_export(obj)->exp_obd->u.cli;
                  struct lov_oinfo      *loi = obj->oo_oinfo;
          
                  return (*printer)(env, cookie, LUSTRE_OSC_NAME"-page@%p: "
                                    "1< %#x %d %u %s %s %s > "
                                    "2< "LPU64" %u %u %#x %#x | %p %p %p %p %p > "
                                    "3< %s %p %d %lu %d > "
                                    "4< %d %d %d %lu %s | %s %s %s %s > "
                                    "5< %s %s %s %s | %d %s %s | %d %s %s>\n",
                                    opg,
                                    /* 1 */
                                    oap->oap_magic, oap->oap_cmd,
                                    oap->oap_interrupted,
                                    osc_list(&oap->oap_pending_item),
                                    osc_list(&oap->oap_urgent_item),
                                    osc_list(&oap->oap_rpc_item),
                                    /* 2 */
                                    oap->oap_obj_off, oap->oap_page_off, oap->oap_count,
                                    oap->oap_async_flags, oap->oap_brw_flags,
                                    oap->oap_request,
                                    oap->oap_cli, oap->oap_loi, oap->oap_caller_ops,
                                    oap->oap_caller_data,
                                    /* 3 */
                                    osc_list(&opg->ops_inflight),
                                    opg->ops_submitter, opg->ops_transfer_pinned,
                                    osc_submit_duration(opg), opg->ops_srvlock,
                                    /* 4 */
                                    cli->cl_r_in_flight, cli->cl_w_in_flight,
                                    cli->cl_max_rpcs_in_flight,
                                    cli->cl_avail_grant,
                                    osc_list(&cli->cl_cache_waiters),
                                    osc_list(&cli->cl_loi_ready_list),
                                    osc_list(&cli->cl_loi_hp_ready_list),
                                    osc_list(&cli->cl_loi_write_list),
                                    osc_list(&cli->cl_loi_read_list),
                                    /* 5 */
                                    osc_list(&loi->loi_ready_item),
                                    osc_list(&loi->loi_hp_ready_item),
                                    osc_list(&loi->loi_write_item),
                                    osc_list(&loi->loi_read_item),
                                    loi->loi_read_lop.lop_num_pending,
                                    osc_list(&loi->loi_read_lop.lop_pending),
                                    osc_list(&loi->loi_read_lop.lop_urgent),
                                    loi->loi_write_lop.lop_num_pending,
                                    osc_list(&loi->loi_write_lop.lop_pending),
                                    osc_list(&loi->loi_write_lop.lop_urgent));
          }
          
          wjt27 Wojciech Turek added a comment - - edited static int osc_page_print( const struct lu_env *env, const struct cl_page_slice *slice, void *cookie, lu_printer_t printer) { struct osc_page *opg = cl2osc_page(slice); struct osc_async_page *oap = &opg->ops_oap; struct osc_object *obj = cl2osc(slice->cpl_obj); struct client_obd *cli = &osc_export(obj)->exp_obd->u.cli; struct lov_oinfo *loi = obj->oo_oinfo; return (*printer)(env, cookie, LUSTRE_OSC_NAME "-page@%p: " "1< %#x %d %u %s %s %s > " "2< " LPU64 " %u %u %#x %#x | %p %p %p %p %p > " "3< %s %p %d %lu %d > " "4< %d %d %d %lu %s | %s %s %s %s > " "5< %s %s %s %s | %d %s %s | %d %s %s>\n" , opg, /* 1 */ oap->oap_magic, oap->oap_cmd, oap->oap_interrupted, osc_list(&oap->oap_pending_item), osc_list(&oap->oap_urgent_item), osc_list(&oap->oap_rpc_item), /* 2 */ oap->oap_obj_off, oap->oap_page_off, oap->oap_count, oap->oap_async_flags, oap->oap_brw_flags, oap->oap_request, oap->oap_cli, oap->oap_loi, oap->oap_caller_ops, oap->oap_caller_data, /* 3 */ osc_list(&opg->ops_inflight), opg->ops_submitter, opg->ops_transfer_pinned, osc_submit_duration(opg), opg->ops_srvlock, /* 4 */ cli->cl_r_in_flight, cli->cl_w_in_flight, cli->cl_max_rpcs_in_flight, cli->cl_avail_grant, osc_list(&cli->cl_cache_waiters), osc_list(&cli->cl_loi_ready_list), osc_list(&cli->cl_loi_hp_ready_list), osc_list(&cli->cl_loi_write_list), osc_list(&cli->cl_loi_read_list), /* 5 */ osc_list(&loi->loi_ready_item), osc_list(&loi->loi_hp_ready_item), osc_list(&loi->loi_write_item), osc_list(&loi->loi_read_item), loi->loi_read_lop.lop_num_pending, osc_list(&loi->loi_read_lop.lop_pending), osc_list(&loi->loi_read_lop.lop_urgent), loi->loi_write_lop.lop_num_pending, osc_list(&loi->loi_write_lop.lop_pending), osc_list(&loi->loi_write_lop.lop_urgent)); }
          jay Jinshan Xiong (Inactive) added a comment - - edited

          Hi Wojciech Turek,

          can you please post osc_page_print() function from your source code here? Also which pages have been applied. Thanks in advance.

          jay Jinshan Xiong (Inactive) added a comment - - edited Hi Wojciech Turek, can you please post osc_page_print() function from your source code here? Also which pages have been applied. Thanks in advance.

          The clients are running lustre-2.1.2

          lustre-client-2.1.2-2.6.32_220.23.1.el6.x86_64.x86_64
          lustre-client-modules-2.1.2-2.6.32_220.23.1.el6.x86_64.x86_64

          Lustre-2.1.2 changelog suggests that it already has a fix for LU-1320

          wjt27 Wojciech Turek added a comment - The clients are running lustre-2.1.2 lustre-client-2.1.2-2.6.32_220.23.1.el6.x86_64.x86_64 lustre-client-modules-2.1.2-2.6.32_220.23.1.el6.x86_64.x86_64 Lustre-2.1.2 changelog suggests that it already has a fix for LU-1320

          Can you please tell me what exactly version you installed on the client node?

          Also, this patch may help: "LU-1320 llite: fix a race between readpage and releasepage".

          jay Jinshan Xiong (Inactive) added a comment - Can you please tell me what exactly version you installed on the client node? Also, this patch may help: " LU-1320 llite: fix a race between readpage and releasepage".

          People

            jay Jinshan Xiong (Inactive)
            wjt27 Wojciech Turek
            Votes:
            2 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: