Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8273

lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      With landing of LU-8157 "swap layout tests", I hit this assertion for the first time in a long while, in sanity test 405:

      [183262.685161] Lustre: DEBUG MARKER: centos6-9.localnet: == sanity test 405: Various layout swap lock tests =================================================== 07:46:13 (1465904773)
      [183275.828024] LustreError: 8225:0:(lov_io.c:238:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed: 
      [183275.829228] LustreError: 8225:0:(lov_io.c:238:lov_sub_get()) LBUG
      [183275.829934] Pid: 8225, comm: swap_lock_test
      [183275.830665] 
      Call Trace:
      [183275.831909]  [<ffffffffa01a97b3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [183275.832577]  [<ffffffffa01a9d55>] lbug_with_loc+0x45/0xc0 [libcfs]
      [183275.833165]  [<ffffffffa08b1e75>] lov_sub_get+0x4e5/0x650 [lov]
      [183275.833738]  [<ffffffffa08b492d>] lov_sublock_env_get.isra.4+0xbd/0x100 [lov]
      [183275.835312]  [<ffffffffa08b5392>] lov_lock_sub_init+0x2c2/0x9f0 [lov]
      [183275.835905]  [<ffffffffa08b5af7>] lov_lock_init_raid0+0x37/0xf0 [lov]
      [183275.836493]  [<ffffffffa08c172f>] lov_lock_init+0x1f/0x60 [lov]
      [183275.837086]  [<ffffffffa0349a6f>] cl_lock_init+0x8f/0x190 [obdclass]
      [183275.837711]  [<ffffffffa034bcd8>] ? cl_io_init0.isra.15+0x88/0x160 [obdclass]
      [183275.838778]  [<ffffffffa0349bb5>] cl_lock_request+0x45/0x1f0 [obdclass]
      [183275.839389]  [<ffffffffa0f29f79>] cl_get_grouplock+0x189/0x310 [lustre]
      [183275.839977]  [<ffffffffa0ee0a69>] ll_get_grouplock+0x179/0x530 [lustre]
      [183275.840599]  [<ffffffffa0eefb8d>] ll_file_ioctl+0x372d/0x38f0 [lustre]
      [183275.841183]  [<ffffffff81202775>] do_vfs_ioctl+0x305/0x520
      [183275.841748]  [<ffffffff810b0c71>] ? finish_task_switch+0x81/0x180
      [183275.842316]  [<ffffffff810b0c34>] ? finish_task_switch+0x44/0x180
      [183275.842888]  [<ffffffff81202a31>] SyS_ioctl+0xa1/0xc0
      [183275.843525]  [<ffffffff81711809>] system_call_fastpath+0x16/0x1b
      [183275.844102] 
      [183275.845572] Kernel panic - not syncing: LBUG
      

      Crashdump and modules are in /exports/crash/192.168.10.219-2016-06-14-07:46:34
      tag in my tree: master-20160614

      Attachments

        Issue Links

          Activity

            [LU-8273] lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed
            pjones Peter Jones added a comment -

            As this is rare and suspected to be fixed I will mark it as a duplicate of LU-2766 until evidence arises that proves otherwise

            pjones Peter Jones added a comment - As this is rare and suspected to be fixed I will mark it as a duplicate of LU-2766 until evidence arises that proves otherwise
            green Oleg Drokin added a comment -

            Ok I landed LU-2766 and we'll see if this ever repeats.

            green Oleg Drokin added a comment - Ok I landed LU-2766 and we'll see if this ever repeats.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I can't think of a reason why group lock requires ci_ignore_layout but this could be due to deadlock. Can you please check git history to see if there is a commit related and if not, just try to clear ci_ignore_layout and see how it goes?

            Actually this is a reproduction of LU-2766, and the patch is located at: http://review.whamcloud.com/#/c/6828/11

            jay Jinshan Xiong (Inactive) added a comment - - edited I can't think of a reason why group lock requires ci_ignore_layout but this could be due to deadlock. Can you please check git history to see if there is a commit related and if not, just try to clear ci_ignore_layout and see how it goes? Actually this is a reproduction of LU-2766 , and the patch is located at: http://review.whamcloud.com/#/c/6828/11
            bobijam Zhenyu Xu added a comment - - edited

            Hi Jinshan,

            cl_get_grouplock() is a layout ignorance IO (io->ci_ignore_layout = 1), and in IO initialization
            lov_io_init()->LOV_2DISPATCH_MAYLOCK(..., llo_io_init, !io->ci_ignore_layout, ...), it does not take lov->lo_type_guard semaphore, and the dump shows that at this time, the file object is an empty one; I think at the same time, there is a race thread which is changing the file's layout from LLT_EMPTY to LLT_RAID0, and since the lov_io_init() does not takes the lo_type_guard semaphore, which makes the IO a lov_empty_io while the lov_object is a raid0 object.

            bobijam Zhenyu Xu added a comment - - edited Hi Jinshan, cl_get_grouplock() is a layout ignorance IO (io->ci_ignore_layout = 1), and in IO initialization lov_io_init()->LOV_2DISPATCH_MAYLOCK(..., llo_io_init, !io->ci_ignore_layout, ...), it does not take lov->lo_type_guard semaphore, and the dump shows that at this time, the file object is an empty one; I think at the same time, there is a race thread which is changing the file's layout from LLT_EMPTY to LLT_RAID0, and since the lov_io_init() does not takes the lo_type_guard semaphore, which makes the IO a lov_empty_io while the lov_object is a raid0 object.
            bobijam Zhenyu Xu added a comment -

            Somehow the io is an lov_empty_io, while the lov_object is a raid0 object.

            crash> struct lov_io ffff880021f77e68
            struct lov_io {
              lis_cl = {
                cis_io = 0xffff8800aae81eb8, 
                cis_obj = 0xffff880070e71e58, 
                cis_iop = 0xffffffffa08d07a0 <lov_empty_io_ops>, 
                cis_linkage = {
                  next = 0xffff8800aae81ed0, 
                  prev = 0xffff8800438a6f20
                }
              }, 
              lis_object = 0xffff880070e71e58, 
              lis_io_endpos = 0, 
              lis_pos = 0, 
              lis_endpos = 0, 
              lis_mem_frozen = 0, 
              lis_stripe_count = 0, 
              lis_active_subios = 0, 
              lis_single_subio_index = 0, 
              lis_single_subio = {
                ci_type = CIT_READ, 
                ci_state = CIS_ZERO, 
                ci_obj = 0x0, 
                ci_parent = 0x0, 
                ci_layers = {
                  next = 0x0, 
                  prev = 0x0
                }, 
                ci_lockset = {
                  cls_todo = {
                    next = 0x0, 
                    prev = 0x0
                  }, 
                  cls_done = {
                    next = 0x0, 
                    prev = 0x0
                  }
                }, 
                ci_lockreq = CILR_MANDATORY, 
                u = {
                  ci_rd = {
                    rd = {
                      crw_pos = 0, 
                      crw_count = 0, 
                      crw_nonblock = 0
                    }
                  }, 
                  ci_wr = {
                    wr = {
                      crw_pos = 0, 
                      crw_count = 0, 
                      crw_nonblock = 0
                    }, 
                    wr_append = 0, 
                    wr_sync = 0
                  }, 
                  ci_rw = {
                    crw_pos = 0, 
                    crw_count = 0, 
                    crw_nonblock = 0
                  }, 
                  ci_setattr = {
                    sa_attr = {
                      lvb_size = 0, 
                      lvb_mtime = 0, 
                      lvb_atime = 0, 
                      lvb_ctime = 0, 
                      lvb_blocks = 0, 
                      lvb_mtime_ns = 0, 
                      lvb_atime_ns = 0, 
                      lvb_ctime_ns = 0, 
                      lvb_padding = 0
                    }, 
                    sa_attr_flags = 0, 
                    sa_valid = 0, 
                    sa_stripe_index = 0, 
                    sa_parent_fid = 0x0
                  }, 
                  ci_data_version = {
                    dv_data_version = 0, 
                    dv_flags = 0
                  }, 
                  ci_fault = {
                    ft_index = 0, 
                    ft_nob = 0, 
                    ft_writable = 0, 
                    ft_executable = 0, 
                    ft_mkwrite = 0, 
                    ft_page = 0x0
                  }, 
                  ci_fsync = {
                    fi_start = 0, 
                    fi_end = 0, 
                    fi_fid = 0x0, 
                    fi_mode = CL_FSYNC_NONE, 
                    fi_nr_written = 0
                  }, 
                  ci_ladvise = {
                    li_start = 0, 
                    li_end = 0, 
                    li_fid = 0x0, 
                    li_advice = LU_LADVISE_INVALID, 
                    li_flags = 0
                  }
                }, 
                ci_queue = {
                 ...
                }, 
                ci_nob = 0, 
                ci_result = 0, 
                ci_continue = 0, 
                ci_no_srvlock = 0, 
                ci_need_restart = 0, 
                ci_ignore_layout = 0, 
                ci_verify_layout = 0, 
                ci_restore_needed = 0, 
                ci_noatime = 0, 
                ci_owned_nr = 0
              }, 
              lis_nr_subios = 0, 
              lis_subs = 0x0, 
              lis_active = {
                next = 0x0, 
                prev = 0x0
              }
            }
            
            crash> struct lov_object 0xffff880070e71e58
            struct lov_object {
              lo_cl = {
                co_lu = {
                  lo_header = 0xffff88001e6e3f08, 
                  lo_dev = 0xffff880015a28f00, 
                  lo_ops = 0xffffffffa08d1320 <lov_lu_obj_ops>, 
                  lo_linkage = {
                    next = 0xffff88001e6e3f48, 
                    prev = 0xffff88001e6e3fb8
                  }, 
                  lo_dev_ref = {<No data fields>}
                }, 
                co_ops = 0xffffffffa08d1360 <lov_ops>, 
                co_slice_off = 144
              }, 
            ...
              lo_type = LLT_RAID0, 
              lo_layout_invalid = false, 
              lo_active_ios = {
                counter = 1
              }, 
            ...
              lo_lsm = 0xffff880089d9b1c0, 
            ...
            }
            
            bobijam Zhenyu Xu added a comment - Somehow the io is an lov_empty_io, while the lov_object is a raid0 object. crash> struct lov_io ffff880021f77e68 struct lov_io { lis_cl = { cis_io = 0xffff8800aae81eb8, cis_obj = 0xffff880070e71e58, cis_iop = 0xffffffffa08d07a0 <lov_empty_io_ops>, cis_linkage = { next = 0xffff8800aae81ed0, prev = 0xffff8800438a6f20 } }, lis_object = 0xffff880070e71e58, lis_io_endpos = 0, lis_pos = 0, lis_endpos = 0, lis_mem_frozen = 0, lis_stripe_count = 0, lis_active_subios = 0, lis_single_subio_index = 0, lis_single_subio = { ci_type = CIT_READ, ci_state = CIS_ZERO, ci_obj = 0x0, ci_parent = 0x0, ci_layers = { next = 0x0, prev = 0x0 }, ci_lockset = { cls_todo = { next = 0x0, prev = 0x0 }, cls_done = { next = 0x0, prev = 0x0 } }, ci_lockreq = CILR_MANDATORY, u = { ci_rd = { rd = { crw_pos = 0, crw_count = 0, crw_nonblock = 0 } }, ci_wr = { wr = { crw_pos = 0, crw_count = 0, crw_nonblock = 0 }, wr_append = 0, wr_sync = 0 }, ci_rw = { crw_pos = 0, crw_count = 0, crw_nonblock = 0 }, ci_setattr = { sa_attr = { lvb_size = 0, lvb_mtime = 0, lvb_atime = 0, lvb_ctime = 0, lvb_blocks = 0, lvb_mtime_ns = 0, lvb_atime_ns = 0, lvb_ctime_ns = 0, lvb_padding = 0 }, sa_attr_flags = 0, sa_valid = 0, sa_stripe_index = 0, sa_parent_fid = 0x0 }, ci_data_version = { dv_data_version = 0, dv_flags = 0 }, ci_fault = { ft_index = 0, ft_nob = 0, ft_writable = 0, ft_executable = 0, ft_mkwrite = 0, ft_page = 0x0 }, ci_fsync = { fi_start = 0, fi_end = 0, fi_fid = 0x0, fi_mode = CL_FSYNC_NONE, fi_nr_written = 0 }, ci_ladvise = { li_start = 0, li_end = 0, li_fid = 0x0, li_advice = LU_LADVISE_INVALID, li_flags = 0 } }, ci_queue = { ... }, ci_nob = 0, ci_result = 0, ci_continue = 0, ci_no_srvlock = 0, ci_need_restart = 0, ci_ignore_layout = 0, ci_verify_layout = 0, ci_restore_needed = 0, ci_noatime = 0, ci_owned_nr = 0 }, lis_nr_subios = 0, lis_subs = 0x0, lis_active = { next = 0x0, prev = 0x0 } } crash> struct lov_object 0xffff880070e71e58 struct lov_object { lo_cl = { co_lu = { lo_header = 0xffff88001e6e3f08, lo_dev = 0xffff880015a28f00, lo_ops = 0xffffffffa08d1320 <lov_lu_obj_ops>, lo_linkage = { next = 0xffff88001e6e3f48, prev = 0xffff88001e6e3fb8 }, lo_dev_ref = {<No data fields>} }, co_ops = 0xffffffffa08d1360 <lov_ops>, co_slice_off = 144 }, ... lo_type = LLT_RAID0, lo_layout_invalid = false , lo_active_ios = { counter = 1 }, ... lo_lsm = 0xffff880089d9b1c0, ... }
            green Oleg Drokin added a comment -

            it's my private node.
            email me your ssh public key and I'll send you the instructions for access

            green Oleg Drokin added a comment - it's my private node. email me your ssh public key and I'll send you the instructions for access
            bobijam Zhenyu Xu added a comment -

            Hi Oleg,

            On which node does /exports/crash/ locates?

            bobijam Zhenyu Xu added a comment - Hi Oleg, On which node does /exports/crash/ locates?
            pjones Peter Jones added a comment -

            Bobijam

            This seems like a rare issue to hit but are you able to see how to address it?

            Peter

            pjones Peter Jones added a comment - Bobijam This seems like a rare issue to hit but are you able to see how to address it? Peter

            People

              bobijam Zhenyu Xu
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: