Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3190

Interop 2.3.0<->2.4 Failed on lustre-rsync-test test 3b: ASSERTION( lio->lis_lsm != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • server: lustre-master tag-2.3.64 build #1411
      client: 2.3.0
    • 3
    • 7783

    Description

      Hit following LBUG when running lustre-rsync-test test 3b
      In tag-2.3.62, the same test passed

      Lustre: DEBUG MARKER: == lustre-rsync-test test 3b: Replicate files created by writemany == 17:57:47 (1366246667)
      LustreError: 6661:0:(lmv_obd.c:850:lmv_iocontrol()) error: iocontrol MDC lustre-MDT0000_UUID on MDTidx 0 cmd c0086696: err = -2
      LustreError: 6661:0:(lmv_obd.c:850:lmv_iocontrol()) Skipped 1415 previous similar messages
      Lustre: DEBUG MARKER: == lustre-rsync-test test 3c: Replicate files created by createmany/unlinkmany == 17:59:17 (1366246757)
      Lustre: DEBUG MARKER: == lustre-rsync-test test 4: Replicate files created by iozone == 17:59:33 (1366246773)
      LustreError: 7489:0:(lcommon_cl.c:1210:cl_file_inode_init()) Failure to initialize cl object [0x20001d0f0:0x340d:0x0]: -95
      LustreError: 7489:0:(lcommon_cl.c:1210:cl_file_inode_init()) Failure to initialize cl object [0x20001d0f0:0x340d:0x0]: -95
      LustreError: 7489:0:(lov_io.c:311:lov_io_slice_init()) ASSERTION( lio->lis_lsm != ((void *)0) ) failed: 
      LustreError: 7489:0:(lov_io.c:311:lov_io_slice_init()) LBUG
      Pid: 7489, comm: lustre_rsync
      
      Message from
      Call Trace:
       syslogd@client- [<ffffffffa0996905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      5 at Apr 17 18:0 [<ffffffffa0996f17>] lbug_with_loc+0x47/0xb0 [libcfs]
      0:04 ...
       kern [<ffffffffa06b5088>] lov_io_init_raid0+0x6d8/0x810 [lov]
      el:LustreError:  [<ffffffffa06ac037>] lov_io_init+0x97/0x160 [lov]
      7489:0:(lov_io.c [<ffffffffa0dd1578>] cl_io_init0+0x98/0x160 [obdclass]
       [<ffffffffa0dd4464>] cl_io_init+0x64/0x100 [obdclass]
       [<ffffffffa07e6fed>] cl_glimpse_size0+0x7d/0x190 [lustre]
      :311:lov_io_slic [<ffffffffa07a3f32>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
       [<ffffffffa07a4049>] ll_getattr_it+0x49/0x170 [lustre]
       [<ffffffffa07a41a7>] ll_getattr+0x37/0x40 [lustre]
       [<ffffffff81214343>] ? security_inode_getattr+0x23/0x30
      e_init()) ASSERT [<ffffffff81180571>] vfs_getattr+0x51/0x80
       [<ffffffffa09a2088>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffff8118082f>] vfs_fstat+0x3f/0x60
       [<ffffffff81180874>] sys_newfstat+0x24/0x40
       [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
       [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      
      ION( lio->lis_lsKernel panic - not syncing: LBUG
      m != ((void *)0)Pid: 7489, comm: lustre_rsync Not tainted 2.6.32-279.5.1.el6.x86_64 #1
       ) failed: 
      Call Trace:
      
      Message from s [<ffffffff814fd24a>] ? panic+0xa0/0x168
      yslogd@client-5  [<ffffffffa0996f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      at Apr 17 18:00: [<ffffffffa06b5088>] ? lov_io_init_raid0+0x6d8/0x810 [lov]
      04 ...
       kernel [<ffffffffa06ac037>] ? lov_io_init+0x97/0x160 [lov]
      :LustreError: 74 [<ffffffffa0dd1578>] ? cl_io_init0+0x98/0x160 [obdclass]
      89:0:(lov_io.c:3 [<ffffffffa0dd4464>] ? cl_io_init+0x64/0x100 [obdclass]
      11:lov_io_slice_ [<ffffffffa07e6fed>] ? cl_glimpse_size0+0x7d/0x190 [lustre]
      init()) LBUG
       [<ffffffffa07a3f32>] ? ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
      
      Message from  [<ffffffffa07a4049>] ? ll_getattr_it+0x49/0x170 [lustre]
      syslogd@client-5 [<ffffffffa07a41a7>] ? ll_getattr+0x37/0x40 [lustre]
       at Apr 17 18:00 [<ffffffff81214343>] ? security_inode_getattr+0x23/0x30
      :04 ...
       kerne [<ffffffff81180571>] ? vfs_getattr+0x51/0x80
      l:Kernel panic - [<ffffffffa09a2088>] ? libcfs_log_return+0x28/0x40 [libcfs]
       not syncing: LB [<ffffffff8118082f>] ? vfs_fstat+0x3f/0x60
      UG
       [<ffffffff81180874>] ? sys_newfstat+0x24/0x40
       [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
       [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
      Initializing cgroup subsys cpuset
      

      Attachments

        Issue Links

          Activity

            [LU-3190] Interop 2.3.0<->2.4 Failed on lustre-rsync-test test 3b: ASSERTION( lio->lis_lsm != ((void *)0) ) failed

            Fixed for 2.4.0, won't fix for 2.3.0 anymore.

            adilger Andreas Dilger added a comment - Fixed for 2.4.0, won't fix for 2.3.0 anymore.

            With the http://review.whamcloud.com/6252 patch landed to return -ENOENT for old FIDs (or is that -ESTALE?), does lustre-rsync-test still crash on a 2.3 client? Is change 6167 even needed for b2_3 anymore?

            adilger Andreas Dilger added a comment - With the http://review.whamcloud.com/6252 patch landed to return -ENOENT for old FIDs (or is that -ESTALE?), does lustre-rsync-test still crash on a 2.3 client? Is change 6167 even needed for b2_3 anymore?

            Lower priority as patch landed to master.
            But http://review.whamcloud.com/#change,6167 still needs to land.

            jlevi Jodi Levi (Inactive) added a comment - Lower priority as patch landed to master. But http://review.whamcloud.com/#change,6167 still needs to land.
            sarah Sarah Liu added a comment -

            Patch set 7 didn't hit the LBUG but still failed test_7 as LU-3279

            sarah Sarah Liu added a comment - Patch set 7 didn't hit the LBUG but still failed test_7 as LU-3279
            di.wang Di Wang added a comment -

            hmm, we probably should not put OSD_OII_NOGEN to the oi_cache, which make us unable to compare the generation at all. Just update the patch, please have a look.

            di.wang Di Wang added a comment - hmm, we probably should not put OSD_OII_NOGEN to the oi_cache, which make us unable to compare the generation at all. Just update the patch, please have a look.

            In fact, the per-thread based "FID => ino#/gen" mapping should be fixed, if someone removed the inode, and caused the OI mapping deleted from the OI file, the related "inode::n_link" should be zero, or if the inode is reused by others, then the "inode::i_generation" will be changed. So other threads cached old "FID => ion/gen" mapping will be invalid automatically, because nobody can use the old "ino/gen" to find out the inode. One special case is that: if the cached generation is "OSD_OII_NOGEN", then we need verify with the inode's LMA.

            yong.fan nasf (Inactive) added a comment - In fact, the per-thread based "FID => ino#/gen" mapping should be fixed, if someone removed the inode, and caused the OI mapping deleted from the OI file, the related "inode::n_link" should be zero, or if the inode is reused by others, then the "inode::i_generation" will be changed. So other threads cached old "FID => ion/gen" mapping will be invalid automatically, because nobody can use the old "ino/gen" to find out the inode. One special case is that: if the cached generation is "OSD_OII_NOGEN", then we need verify with the inode's LMA.
            di.wang Di Wang added a comment -

            Sigh, only remove oi_cache in current thread info seems not enough, and we should remove this oi in the cache of all thread infos.

            di.wang Di Wang added a comment - Sigh, only remove oi_cache in current thread info seems not enough, and we should remove this oi in the cache of all thread infos.
            sarah Sarah Liu added a comment -

            After running patch set 5 for 3 times, cannot reproduce this issue, usually will hit this bug before sub test_5. All runs failed on sub test_7, please refer to LU-3279

            https://maloo.whamcloud.com/test_sets/5584e96e-b5de-11e2-9d08-52540035b04c

            sarah Sarah Liu added a comment - After running patch set 5 for 3 times, cannot reproduce this issue, usually will hit this bug before sub test_5. All runs failed on sub test_7, please refer to LU-3279 https://maloo.whamcloud.com/test_sets/5584e96e-b5de-11e2-9d08-52540035b04c
            di.wang Di Wang added a comment -

            Yes, I totally agree we should not return LinkEA for the object once it is removed from the namespace.

            di.wang Di Wang added a comment - Yes, I totally agree we should not return LinkEA for the object once it is removed from the namespace.

            Jinshan,
            I don't agree with your comment that it is OK to have two inodes with the same pathname at the same time. That isn't possible in the namespace, just like you cannot have two files in the same directory with the same filename at the same time. I'm not objecting to two inodes that had the same name at different times (which seems to be the case here), but only the new one should resolve to the pathname with fid2path(). The old file can return any other pathnames that it still has, or return an ENOENT error if it is unlinked.

            Di,
            it still seems to make sense to not return any pathnames in the case of an open-unlinked inode. In this case, even if the oti_cache was pointing to the unlinked inode, there shouldn't have been any entries in the "link" xattr to return, so either the inode didn't get written to disk after it was unlinked, or the "link" entries are not being written for unlinked inodes. I still think that case needs to be handled properly, and checking the "DEAD" and "ORPHAN" flags is probably the right way to go.

            adilger Andreas Dilger added a comment - Jinshan, I don't agree with your comment that it is OK to have two inodes with the same pathname at the same time. That isn't possible in the namespace, just like you cannot have two files in the same directory with the same filename at the same time. I'm not objecting to two inodes that had the same name at different times (which seems to be the case here), but only the new one should resolve to the pathname with fid2path(). The old file can return any other pathnames that it still has, or return an ENOENT error if it is unlinked. Di, it still seems to make sense to not return any pathnames in the case of an open-unlinked inode. In this case, even if the oti_cache was pointing to the unlinked inode, there shouldn't have been any entries in the "link" xattr to return, so either the inode didn't get written to disk after it was unlinked, or the "link" entries are not being written for unlinked inodes. I still think that case needs to be handled properly, and checking the "DEAD" and "ORPHAN" flags is probably the right way to go.

            People

              bobijam Zhenyu Xu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: