Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10407

osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • Lustre 2.11.0
    • None
    • RHEL 7 + master 063a83ab1fe518e52dbc7fb5f6e9d092b20f44e9.
    • 3
    • 9223372036854775807

    Description

      while testing an own patch I hit a this panic.
      As I see, it's result of not atomically counting of unstable pages.
      osc_io_commit_async add to cache and unlock an osc object, and size will be update a short after. osc_io_unplug called from brw_queue_work and found a pages in cache, when tries to count an unstable pages number while panic a hit.

      this panic can easy replicated if we have delay a size updates with patch

      diff --git a/lustre/osc/osc_io.c b/lustre/osc/osc_io.c
      index 3d353324f1..5471b96ec1 100644
      --- a/lustre/osc/osc_io.c
      +++ b/lustre/osc/osc_io.c
      @@ -267,7 +267,7 @@ int osc_io_commit_async(const struct lu_env *env,
       	struct osc_object *osc = cl2osc(ios->cis_obj);
       	struct cl_page  *page;
       	struct cl_page  *last_page;
      -	struct osc_page *opg;
      +	struct osc_page *opg = NULL;
       	int result = 0;
       	ENTRY;
       
      @@ -311,9 +311,6 @@ int osc_io_commit_async(const struct lu_env *env,
       				break;
       		}
       
      -		osc_page_touch_at(env, osc2cl(osc), osc_index(opg),
      -				  page == last_page ? to : PAGE_SIZE);
      -
       		cl_page_list_del(env, qin, page);
       
       		(*cb)(env, io, page);
      @@ -321,6 +318,9 @@ int osc_io_commit_async(const struct lu_env *env,
       		 * complete at any time. */
       	}
       
      +	osc_page_touch_at(env, osc2cl(osc), osc_index(opg),
      +			  page == last_page ? to : PAGE_SIZE);
      +
       	/* for sync write, kernel will wait for this page to be flushed before
       	 * osc_io_end() is called, so release it earlier.
       	 * for mkwrite(), it's known there is no further pages. */
      

      panic hit constantly with

      1. ONLY=42 REFORMAT=yes sh sanity.sh

      typically in 42e test,.

      Attachments

        Issue Links

          Activity

            [LU-10407] osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed:
            bzzz Alex Zhuravlev added a comment - - edited

            -https://testing.whamcloud.com/test_sessions/0f2a364a-716a-4eec-9786-3fa81b3143c2-

            Moved to LU-14326, closing this one for 2.13 per previous comment.

            bzzz Alex Zhuravlev added a comment - - edited - https://testing.whamcloud.com/test_sessions/0f2a364a-716a-4eec-9786-3fa81b3143c2- Moved to LU-14326 , closing this one for 2.13 per previous comment.

            One of the patches that landed during 2.13 fixed this issue. Should be close this?

            simmonsja James A Simmons added a comment - One of the patches that landed during 2.13 fixed this issue. Should be close this?
            green Oleg Drokin added a comment -

            I just hit this in master-next running racer

            [35616.314978] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: 
            [35616.322765] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) LBUG
            [35616.326801] Pid: 11162, comm: ldlm_bl_05 3.10.0-7.5-debug #1 SMP Sun Jun 3 13:35:38 EDT 2018
            [35616.329285] Call Trace:
            [35616.330443]  [<ffffffffa01eb7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
            [35616.331870]  [<ffffffffa01eb88c>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [35616.334288]  [<ffffffffa0d3eda6>] osc_extent_make_ready+0x936/0xe70 [osc]
            [35616.335640]  [<ffffffffa0d45ab3>] osc_cache_writeback_range+0x4f3/0x1260 [osc]
            [35616.337641]  [<ffffffffa0e0bac7>] mdc_lock_flush+0x2e7/0x3f0 [mdc]
            [35616.339070]  [<ffffffffa0e0bfb4>] mdc_ldlm_blocking_ast+0x2f4/0x3f0 [mdc]
            [35616.343073]  [<ffffffffa0b01bd4>] ldlm_cancel_callback+0x84/0x320 [ptlrpc]
            [35616.344357]  [<ffffffffa0b189b0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
            [35616.345719]  [<ffffffffa0b1e7e7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
            [35616.347470]  [<ffffffffa0e0be4a>] mdc_ldlm_blocking_ast+0x18a/0x3f0 [mdc]
            [35616.348765]  [<ffffffffa0b2a67f>] ldlm_handle_bl_callback+0xff/0x530 [ptlrpc]
            [35616.351095]  [<ffffffffa0b2afb1>] ldlm_bl_thread_main+0x501/0x680 [ptlrpc]
            [35616.352495]  [<ffffffff810ae864>] kthread+0xe4/0xf0
            [35616.353757]  [<ffffffff81783777>] ret_from_fork_nospec_end+0x0/0x39
            
            green Oleg Drokin added a comment - I just hit this in master-next running racer [35616.314978] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: [35616.322765] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) LBUG [35616.326801] Pid: 11162, comm: ldlm_bl_05 3.10.0-7.5-debug #1 SMP Sun Jun 3 13:35:38 EDT 2018 [35616.329285] Call Trace: [35616.330443] [<ffffffffa01eb7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [35616.331870] [<ffffffffa01eb88c>] lbug_with_loc+0x4c/0xa0 [libcfs] [35616.334288] [<ffffffffa0d3eda6>] osc_extent_make_ready+0x936/0xe70 [osc] [35616.335640] [<ffffffffa0d45ab3>] osc_cache_writeback_range+0x4f3/0x1260 [osc] [35616.337641] [<ffffffffa0e0bac7>] mdc_lock_flush+0x2e7/0x3f0 [mdc] [35616.339070] [<ffffffffa0e0bfb4>] mdc_ldlm_blocking_ast+0x2f4/0x3f0 [mdc] [35616.343073] [<ffffffffa0b01bd4>] ldlm_cancel_callback+0x84/0x320 [ptlrpc] [35616.344357] [<ffffffffa0b189b0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc] [35616.345719] [<ffffffffa0b1e7e7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc] [35616.347470] [<ffffffffa0e0be4a>] mdc_ldlm_blocking_ast+0x18a/0x3f0 [mdc] [35616.348765] [<ffffffffa0b2a67f>] ldlm_handle_bl_callback+0xff/0x530 [ptlrpc] [35616.351095] [<ffffffffa0b2afb1>] ldlm_bl_thread_main+0x501/0x680 [ptlrpc] [35616.352495] [<ffffffff810ae864>] kthread+0xe4/0xf0 [35616.353757] [<ffffffff81783777>] ret_from_fork_nospec_end+0x0/0x39

            Hi Alex,

            Which patch do you apply when the issue is triggered?

            Delaying the size update will cause the issue for there could some extent is full and released to be ready to write out
            (its state will be changed to OES_CACHE) during the loop in "osc_io_commit_async". but it won't if the size is updated
            right after the "osc_page_cache_add" for the current active extent will be checked to be non-full before the page is added to it
            (its state will be OES_ACTIVE)

            hongchao.zhang Hongchao Zhang added a comment - Hi Alex, Which patch do you apply when the issue is triggered? Delaying the size update will cause the issue for there could some extent is full and released to be ready to write out (its state will be changed to OES_CACHE) during the loop in "osc_io_commit_async". but it won't if the size is updated right after the "osc_page_cache_add" for the current active extent will be checked to be non-full before the page is added to it (its state will be OES_ACTIVE)

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/30848
            Subject: LU-10407 osc: update size before queue page
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 13031e6ee6eccedfb0b108115599d3d69e4577c9

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/30848 Subject: LU-10407 osc: update size before queue page Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 13031e6ee6eccedfb0b108115599d3d69e4577c9
            pjones Peter Jones added a comment -

            Hongchao

            Can you please look into this?

            Peter

            pjones Peter Jones added a comment - Hongchao Can you please look into this? Peter

            the corresponding osc_extent should be in ACTIVE state so it shouldn't be picked by RPC engine.

            jay Jinshan Xiong (Inactive) added a comment - the corresponding osc_extent should be in ACTIVE state so it shouldn't be picked by RPC engine.

            People

              hongchao.zhang Hongchao Zhang
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: