[LU-8273] lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed Created: 14/Jun/16  Updated: 24/Jul/16  Resolved: 24/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-2766 lov_object.c:635:lov_layout_change())... Resolved
Related
is related to LU-2766 lov_object.c:635:lov_layout_change())... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With landing of LU-8157 "swap layout tests", I hit this assertion for the first time in a long while, in sanity test 405:

[183262.685161] Lustre: DEBUG MARKER: centos6-9.localnet: == sanity test 405: Various layout swap lock tests =================================================== 07:46:13 (1465904773)
[183275.828024] LustreError: 8225:0:(lov_io.c:238:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed: 
[183275.829228] LustreError: 8225:0:(lov_io.c:238:lov_sub_get()) LBUG
[183275.829934] Pid: 8225, comm: swap_lock_test
[183275.830665] 
Call Trace:
[183275.831909]  [<ffffffffa01a97b3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[183275.832577]  [<ffffffffa01a9d55>] lbug_with_loc+0x45/0xc0 [libcfs]
[183275.833165]  [<ffffffffa08b1e75>] lov_sub_get+0x4e5/0x650 [lov]
[183275.833738]  [<ffffffffa08b492d>] lov_sublock_env_get.isra.4+0xbd/0x100 [lov]
[183275.835312]  [<ffffffffa08b5392>] lov_lock_sub_init+0x2c2/0x9f0 [lov]
[183275.835905]  [<ffffffffa08b5af7>] lov_lock_init_raid0+0x37/0xf0 [lov]
[183275.836493]  [<ffffffffa08c172f>] lov_lock_init+0x1f/0x60 [lov]
[183275.837086]  [<ffffffffa0349a6f>] cl_lock_init+0x8f/0x190 [obdclass]
[183275.837711]  [<ffffffffa034bcd8>] ? cl_io_init0.isra.15+0x88/0x160 [obdclass]
[183275.838778]  [<ffffffffa0349bb5>] cl_lock_request+0x45/0x1f0 [obdclass]
[183275.839389]  [<ffffffffa0f29f79>] cl_get_grouplock+0x189/0x310 [lustre]
[183275.839977]  [<ffffffffa0ee0a69>] ll_get_grouplock+0x179/0x530 [lustre]
[183275.840599]  [<ffffffffa0eefb8d>] ll_file_ioctl+0x372d/0x38f0 [lustre]
[183275.841183]  [<ffffffff81202775>] do_vfs_ioctl+0x305/0x520
[183275.841748]  [<ffffffff810b0c71>] ? finish_task_switch+0x81/0x180
[183275.842316]  [<ffffffff810b0c34>] ? finish_task_switch+0x44/0x180
[183275.842888]  [<ffffffff81202a31>] SyS_ioctl+0xa1/0xc0
[183275.843525]  [<ffffffff81711809>] system_call_fastpath+0x16/0x1b
[183275.844102] 
[183275.845572] Kernel panic - not syncing: LBUG

Crashdump and modules are in /exports/crash/192.168.10.219-2016-06-14-07:46:34
tag in my tree: master-20160614



 Comments   
Comment by Peter Jones [ 14/Jun/16 ]

Bobijam

This seems like a rare issue to hit but are you able to see how to address it?

Peter

Comment by Zhenyu Xu [ 15/Jun/16 ]

Hi Oleg,

On which node does /exports/crash/ locates?

Comment by Oleg Drokin [ 15/Jun/16 ]

it's my private node.
email me your ssh public key and I'll send you the instructions for access

Comment by Zhenyu Xu [ 15/Jun/16 ]

Somehow the io is an lov_empty_io, while the lov_object is a raid0 object.

crash> struct lov_io ffff880021f77e68
struct lov_io {
  lis_cl = {
    cis_io = 0xffff8800aae81eb8, 
    cis_obj = 0xffff880070e71e58, 
    cis_iop = 0xffffffffa08d07a0 <lov_empty_io_ops>, 
    cis_linkage = {
      next = 0xffff8800aae81ed0, 
      prev = 0xffff8800438a6f20
    }
  }, 
  lis_object = 0xffff880070e71e58, 
  lis_io_endpos = 0, 
  lis_pos = 0, 
  lis_endpos = 0, 
  lis_mem_frozen = 0, 
  lis_stripe_count = 0, 
  lis_active_subios = 0, 
  lis_single_subio_index = 0, 
  lis_single_subio = {
    ci_type = CIT_READ, 
    ci_state = CIS_ZERO, 
    ci_obj = 0x0, 
    ci_parent = 0x0, 
    ci_layers = {
      next = 0x0, 
      prev = 0x0
    }, 
    ci_lockset = {
      cls_todo = {
        next = 0x0, 
        prev = 0x0
      }, 
      cls_done = {
        next = 0x0, 
        prev = 0x0
      }
    }, 
    ci_lockreq = CILR_MANDATORY, 
    u = {
      ci_rd = {
        rd = {
          crw_pos = 0, 
          crw_count = 0, 
          crw_nonblock = 0
        }
      }, 
      ci_wr = {
        wr = {
          crw_pos = 0, 
          crw_count = 0, 
          crw_nonblock = 0
        }, 
        wr_append = 0, 
        wr_sync = 0
      }, 
      ci_rw = {
        crw_pos = 0, 
        crw_count = 0, 
        crw_nonblock = 0
      }, 
      ci_setattr = {
        sa_attr = {
          lvb_size = 0, 
          lvb_mtime = 0, 
          lvb_atime = 0, 
          lvb_ctime = 0, 
          lvb_blocks = 0, 
          lvb_mtime_ns = 0, 
          lvb_atime_ns = 0, 
          lvb_ctime_ns = 0, 
          lvb_padding = 0
        }, 
        sa_attr_flags = 0, 
        sa_valid = 0, 
        sa_stripe_index = 0, 
        sa_parent_fid = 0x0
      }, 
      ci_data_version = {
        dv_data_version = 0, 
        dv_flags = 0
      }, 
      ci_fault = {
        ft_index = 0, 
        ft_nob = 0, 
        ft_writable = 0, 
        ft_executable = 0, 
        ft_mkwrite = 0, 
        ft_page = 0x0
      }, 
      ci_fsync = {
        fi_start = 0, 
        fi_end = 0, 
        fi_fid = 0x0, 
        fi_mode = CL_FSYNC_NONE, 
        fi_nr_written = 0
      }, 
      ci_ladvise = {
        li_start = 0, 
        li_end = 0, 
        li_fid = 0x0, 
        li_advice = LU_LADVISE_INVALID, 
        li_flags = 0
      }
    }, 
    ci_queue = {
     ...
    }, 
    ci_nob = 0, 
    ci_result = 0, 
    ci_continue = 0, 
    ci_no_srvlock = 0, 
    ci_need_restart = 0, 
    ci_ignore_layout = 0, 
    ci_verify_layout = 0, 
    ci_restore_needed = 0, 
    ci_noatime = 0, 
    ci_owned_nr = 0
  }, 
  lis_nr_subios = 0, 
  lis_subs = 0x0, 
  lis_active = {
    next = 0x0, 
    prev = 0x0
  }
}
crash> struct lov_object 0xffff880070e71e58
struct lov_object {
  lo_cl = {
    co_lu = {
      lo_header = 0xffff88001e6e3f08, 
      lo_dev = 0xffff880015a28f00, 
      lo_ops = 0xffffffffa08d1320 <lov_lu_obj_ops>, 
      lo_linkage = {
        next = 0xffff88001e6e3f48, 
        prev = 0xffff88001e6e3fb8
      }, 
      lo_dev_ref = {<No data fields>}
    }, 
    co_ops = 0xffffffffa08d1360 <lov_ops>, 
    co_slice_off = 144
  }, 
...
  lo_type = LLT_RAID0, 
  lo_layout_invalid = false, 
  lo_active_ios = {
    counter = 1
  }, 
...
  lo_lsm = 0xffff880089d9b1c0, 
...
}
Comment by Zhenyu Xu [ 15/Jun/16 ]

Hi Jinshan,

cl_get_grouplock() is a layout ignorance IO (io->ci_ignore_layout = 1), and in IO initialization
lov_io_init()->LOV_2DISPATCH_MAYLOCK(..., llo_io_init, !io->ci_ignore_layout, ...), it does not take lov->lo_type_guard semaphore, and the dump shows that at this time, the file object is an empty one; I think at the same time, there is a race thread which is changing the file's layout from LLT_EMPTY to LLT_RAID0, and since the lov_io_init() does not takes the lo_type_guard semaphore, which makes the IO a lov_empty_io while the lov_object is a raid0 object.

Comment by Jinshan Xiong (Inactive) [ 15/Jun/16 ]

I can't think of a reason why group lock requires ci_ignore_layout but this could be due to deadlock. Can you please check git history to see if there is a commit related and if not, just try to clear ci_ignore_layout and see how it goes?

Actually this is a reproduction of LU-2766, and the patch is located at: http://review.whamcloud.com/#/c/6828/11

Comment by Oleg Drokin [ 16/Jun/16 ]

Ok I landed LU-2766 and we'll see if this ever repeats.

Comment by Peter Jones [ 24/Jul/16 ]

As this is rare and suspected to be fixed I will mark it as a duplicate of LU-2766 until evidence arises that proves otherwise

Generated at Sat Feb 10 02:16:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.