Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6696

ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • 2
    • 9223372036854775807

    Description

      LustreError: 11-0: hw_nb-OST0016-osc-MDT0000: Communicating with 10.151.26.55@o2ib, operation ost_connect failed with -114.
      LustreError: 6488:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5
      LustreError: 6580:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5
      LustreError: 6580:0:(osp_sync.c:874:osp_sync_thread()) LBUG
      Pid: 6580, comm: osp-syn-36-0
      
      Call Trace:
       [<ffffffffa05cf895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa05cfe97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa10d9243>] osp_sync_thread+0x753/0x7d0 [osp]
       [<ffffffff81559b9e>] ? thread_return+0x4e/0x770
       [<ffffffffa10d8af0>] ? osp_sync_thread+0x0/0x7d0 [osp]
      
      Entering kdb (current=0xffff8803b5e04080, pid 6580) on processor 3 Oops: (null)
      due to oops @ 0x0
      kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
      [3]kdb> 
      

      Attachments

        Issue Links

          Activity

            [LU-6696] ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/15247
            Subject: LU-6696 llog: tool to fix corrupted llog catalog
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 289722425f1195bc2438cd65dd9bcfa0243fcf46

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/15247 Subject: LU-6696 llog: tool to fix corrupted llog catalog Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 289722425f1195bc2438cd65dd9bcfa0243fcf46

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/15245
            Subject: LU-6696 llog: tool to fix corrupted llog catalog
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 000ed45ddda77930f00cf3d2791e9af703aab95d

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/15245 Subject: LU-6696 llog: tool to fix corrupted llog catalog Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 000ed45ddda77930f00cf3d2791e9af703aab95d

            Upload fixed llog catalog file

            tappro Mikhail Pershin added a comment - Upload fixed llog catalog file
            bobijam Zhenyu Xu added a comment -

            Mikhail, can you upload the llog?

            bobijam Zhenyu Xu added a comment - Mikhail, can you upload the llog?

            My manager asks to raise the priority (currently 3) because the production filesystem is not available.

            jaylan Jay Lan (Inactive) added a comment - My manager asks to raise the priority (currently 3) because the production filesystem is not available.

            well, llog is corrupted in some strange way, meanwhile I've found that llog contained 4 records with indeces 61,62,63,64. Llog itself contains only 3 records 62, 63 and 64. And everything before those records are just garbage. I've fixed llog manually so it looks healthy now and contains those three records:

            # lustre/utils/llog_reader cb2000d_9c396a65_fixed 
            Bit 0 of 3 not set
            rec #62 type=1064553b len=64
            rec #63 type=1064553b len=64
            rec #64 type=1064553b len=64
            Header size : 8192
            Time : Fri Nov  7 09:00:21 2008
            Number of records: 3
            Target uuid :  
            -----------------------
            #62 (064)ogen=0 name=0x3bf:1
            #63 (064)ogen=0 name=0x419:1
            #64 (064)ogen=0 name=0x448:1
            

            That might help to revive MDS with access at least to those plain llogs.

            tappro Mikhail Pershin added a comment - well, llog is corrupted in some strange way, meanwhile I've found that llog contained 4 records with indeces 61,62,63,64. Llog itself contains only 3 records 62, 63 and 64. And everything before those records are just garbage. I've fixed llog manually so it looks healthy now and contains those three records: # lustre/utils/llog_reader cb2000d_9c396a65_fixed Bit 0 of 3 not set rec #62 type=1064553b len=64 rec #63 type=1064553b len=64 rec #64 type=1064553b len=64 Header size : 8192 Time : Fri Nov 7 09:00:21 2008 Number of records: 3 Target uuid : ----------------------- #62 (064)ogen=0 name=0x3bf:1 #63 (064)ogen=0 name=0x419:1 #64 (064)ogen=0 name=0x448:1 That might help to revive MDS with access at least to those plain llogs.
            tappro Mikhail Pershin added a comment - - edited

            Mahmoud, the llog looks empty, can you upload it again and gzip it before, please?
            That was my browser issue, false alarm. I have the file.

            tappro Mikhail Pershin added a comment - - edited Mahmoud, the llog looks empty, can you upload it again and gzip it before, please? That was my browser issue, false alarm. I have the file.

            attached llog file to the LU.

            mhanafi Mahmoud Hanafi added a comment - attached llog file to the LU.

            It looks like llog has another (or the same) header written from 8192 offset. That is wrong and I'd like to investigate this to understand how that was possible.

            Andreas, I agree, OSP code is quite aggressive towards possible IO errors

            tappro Mikhail Pershin added a comment - It looks like llog has another (or the same) header written from 8192 offset. That is wrong and I'd like to investigate this to understand how that was possible. Andreas, I agree, OSP code is quite aggressive towards possible IO errors

            Can you post that llog file here, please?

            tappro Mikhail Pershin added a comment - Can you post that llog file here, please?

            It should be possible to improve the error handling in this code so that it isn't an LASSERT(), and instead returns an error to the caller. We shouldn't have LASSERT() checks on data that comes from the disk.

            adilger Andreas Dilger added a comment - It should be possible to improve the error handling in this code so that it isn't an LASSERT(), and instead returns an error to the caller. We shouldn't have LASSERT() checks on data that comes from the disk.

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: