Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6696

ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • 2
    • 9223372036854775807

    Description

      LustreError: 11-0: hw_nb-OST0016-osc-MDT0000: Communicating with 10.151.26.55@o2ib, operation ost_connect failed with -114.
      LustreError: 6488:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5
      LustreError: 6580:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5
      LustreError: 6580:0:(osp_sync.c:874:osp_sync_thread()) LBUG
      Pid: 6580, comm: osp-syn-36-0
      
      Call Trace:
       [<ffffffffa05cf895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa05cfe97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa10d9243>] osp_sync_thread+0x753/0x7d0 [osp]
       [<ffffffff81559b9e>] ? thread_return+0x4e/0x770
       [<ffffffffa10d8af0>] ? osp_sync_thread+0x0/0x7d0 [osp]
      
      Entering kdb (current=0xffff8803b5e04080, pid 6580) on processor 3 Oops: (null)
      due to oops @ 0x0
      kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
      [3]kdb> 
      

      Attachments

        Issue Links

          Activity

            [LU-6696] ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

            How do I get the llog ID?

            mhanafi Mahmoud Hanafi added a comment - How do I get the llog ID?
            bobijam Zhenyu Xu added a comment -

            00000004:00000040:1.0:1433912973.463121:0:6221:0:(osp_sync.c:949:osp_sync_llog_init()) hw_nb-OST0024-osc-MDT0000: Init llog for 36 - catid 0xcb2000d:1:9c396a65

            So the llog ID is 0xcb2000d:1:9c396a65 ?

            From code llog_process_thread(), the cur_offset is initialized as LLOG_CHUNK_SIZE, so the block was read from 8192, I don't quite know about llog somehow.

            bobijam Zhenyu Xu added a comment - 00000004:00000040:1.0:1433912973.463121:0:6221:0:(osp_sync.c:949:osp_sync_llog_init()) hw_nb-OST0024-osc-MDT0000: Init llog for 36 - catid 0xcb2000d:1:9c396a65 So the llog ID is 0xcb2000d:1:9c396a65 ? From code llog_process_thread(), the cur_offset is initialized as LLOG_CHUNK_SIZE, so the block was read from 8192, I don't quite know about llog somehow.

            there is no way to stop llog processing, except the llog removal by hands. Meanwhile the record type 0x10645539 is llog_header and it is correct that its lrh_index is 0, so probably llog is not corrupted. The question is why header lies at offset 8192 or maybe block was read from offset 0 somehow? Do you know llog ID? Then it can be found in /O/ directory and analyzed, anyway we have to find it. Don't remove this llog, we need its header with all data in it at first.

            tappro Mikhail Pershin added a comment - there is no way to stop llog processing, except the llog removal by hands. Meanwhile the record type 0x10645539 is llog_header and it is correct that its lrh_index is 0, so probably llog is not corrupted. The question is why header lies at offset 8192 or maybe block was read from offset 0 somehow? Do you know llog ID? Then it can be found in /O/ directory and analyzed, anyway we have to find it. Don't remove this llog, we need its header with all data in it at first.
            bobijam Zhenyu Xu added a comment -

            from the MDT log

            00000040:00000001:3.0:1433912973.463529:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered
            00000040:00001000:3.0:1433912973.463531:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 0 off 8192)
            00000040:00000001:1.0:1433912973.463674:0:6221:0:(llog_osd.c:652:llog_osd_next_block()) Process leaving via out (rc=0 : 0 : 0x0)
            00000040:00000001:1.0:1433912973.463676:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=0 : 0 : 0)
            00000040:00001000:1.0:1433912973.463678:0:6221:0:(llog.c:336:llog_process_thread()) processing rec 0xffff88035dcde000 type 0x10645539
            00000040:00001000:1.0:1433912973.463680:0:6221:0:(llog.c:342:llog_process_thread()) after swabbing, type=0x10645539 idx=0
            00000040:00000001:1.0:1433912973.463682:0:6221:0:(llog.c:347:llog_process_thread()) Process leaving via repeat (rc=0 : 0 : 0x0)
            00000040:00001000:1.0:1433912973.463685:0:6221:0:(llog.c:318:llog_process_thread()) index: 61 last_index 64767
            00000040:00000001:1.0:1433912973.463687:0:6221:0:(lustre_log.h:510:llog_next_block()) Process entered
            00000040:00000001:1.0:1433912973.463688:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered
            00000040:00001000:1.0:1433912973.463689:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 63 off 12224)
            00000040:00000001:1.0:1433912973.463692:0:6221:0:(llog_osd.c:654:llog_osd_next_block()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
            00000040:00000001:1.0:1433912973.463694:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00000040:00000001:1.0:1433912973.463696:0:6221:0:(llog.c:326:llog_process_thread()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
            00000040:00000010:1.0:1433912973.463699:0:6221:0:(llog.c:402:llog_process_thread()) kfreed 'buf': 8192 at ffff88035dcde000.
            00000040:00000010:1.0:1433912973.463701:0:6221:0:(llog.c:480:llog_process_or_fork()) kfreed 'lpi': 80 at ffff88035e1f51c0.
            00000040:00000001:1.0:1433912973.463704:0:6221:0:(llog.c:481:llog_process_or_fork()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00000040:00020000:1.0:1433912973.463706:0:6221:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5
            

            We can see that the llog for this device (OST00024) is corrupted (looking for llog indexed 61, and skipped over to index 63, and find out that the reading offset (12224) is over the llog file size, and -EIO is returned.

            Tappro, how can I disable the llog process of this OST device?

            bobijam Zhenyu Xu added a comment - from the MDT log 00000040:00000001:3.0:1433912973.463529:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered 00000040:00001000:3.0:1433912973.463531:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 0 off 8192) 00000040:00000001:1.0:1433912973.463674:0:6221:0:(llog_osd.c:652:llog_osd_next_block()) Process leaving via out (rc=0 : 0 : 0x0) 00000040:00000001:1.0:1433912973.463676:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=0 : 0 : 0) 00000040:00001000:1.0:1433912973.463678:0:6221:0:(llog.c:336:llog_process_thread()) processing rec 0xffff88035dcde000 type 0x10645539 00000040:00001000:1.0:1433912973.463680:0:6221:0:(llog.c:342:llog_process_thread()) after swabbing, type=0x10645539 idx=0 00000040:00000001:1.0:1433912973.463682:0:6221:0:(llog.c:347:llog_process_thread()) Process leaving via repeat (rc=0 : 0 : 0x0) 00000040:00001000:1.0:1433912973.463685:0:6221:0:(llog.c:318:llog_process_thread()) index: 61 last_index 64767 00000040:00000001:1.0:1433912973.463687:0:6221:0:(lustre_log.h:510:llog_next_block()) Process entered 00000040:00000001:1.0:1433912973.463688:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered 00000040:00001000:1.0:1433912973.463689:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 63 off 12224) 00000040:00000001:1.0:1433912973.463692:0:6221:0:(llog_osd.c:654:llog_osd_next_block()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00000040:00000001:1.0:1433912973.463694:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000040:00000001:1.0:1433912973.463696:0:6221:0:(llog.c:326:llog_process_thread()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00000040:00000010:1.0:1433912973.463699:0:6221:0:(llog.c:402:llog_process_thread()) kfreed 'buf': 8192 at ffff88035dcde000. 00000040:00000010:1.0:1433912973.463701:0:6221:0:(llog.c:480:llog_process_or_fork()) kfreed 'lpi': 80 at ffff88035e1f51c0. 00000040:00000001:1.0:1433912973.463704:0:6221:0:(llog.c:481:llog_process_or_fork()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000040:00020000:1.0:1433912973.463706:0:6221:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5 We can see that the llog for this device (OST00024) is corrupted (looking for llog indexed 61, and skipped over to index 63, and find out that the reading offset (12224) is over the llog file size, and -EIO is returned. Tappro, how can I disable the llog process of this OST device?

            I uploaded debug logs from mdt
            ftp:/uploads/LU6696/lustre-log.1433912973.6315.txt

            mhanafi Mahmoud Hanafi added a comment - I uploaded debug logs from mdt ftp:/uploads/LU6696/lustre-log.1433912973.6315.txt

            btw, it is only the single OST that is causing the LBUG.

            mhanafi Mahmoud Hanafi added a comment - btw, it is only the single OST that is causing the LBUG.

            The MDS and MGS where located on the same device. We got the errors. as part of debugging I moved the mgs and mdt to different devices. Did a tunefs.lustre --writeconf but got the same error.

            What is the fix for the osp_sync_llog_init()

            mhanafi Mahmoud Hanafi added a comment - The MDS and MGS where located on the same device. We got the errors. as part of debugging I moved the mgs and mdt to different devices. Did a tunefs.lustre --writeconf but got the same error. What is the fix for the osp_sync_llog_init()
            bobijam Zhenyu Xu added a comment -

            the OSS shows that it cannot access MGS service

            25823:10000000:00000001:4.0:1433873407.254887:0:81369:0:(mgc_request.c:1106:mgc_enqueue()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            25824:10000000:01000000:4.0:1433873407.254888:0:81369:0:(mgc_request.c:1881:mgc_process_log()) Can't get cfg lock: -5
            

            Are MDS and MGS different nodes, or MDT and MGT locate on different devices? It seems that it's MGS does not response and the llog process cannot move on.

            Another thing to notice is that llog_cat_init_and_process() just discards the return value of llog_process_or_fork(), which makes the osp_sync_init() ignores the error in its osp_sync_llog_init(), I think this part also need fix.

            bobijam Zhenyu Xu added a comment - the OSS shows that it cannot access MGS service 25823:10000000:00000001:4.0:1433873407.254887:0:81369:0:(mgc_request.c:1106:mgc_enqueue()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 25824:10000000:01000000:4.0:1433873407.254888:0:81369:0:(mgc_request.c:1881:mgc_process_log()) Can't get cfg lock: -5 Are MDS and MGS different nodes, or MDT and MGT locate on different devices? It seems that it's MGS does not response and the llog process cannot move on. Another thing to notice is that llog_cat_init_and_process() just discards the return value of llog_process_or_fork(), which makes the osp_sync_init() ignores the error in its osp_sync_llog_init(), I think this part also need fix.

            I uploaded logs (lbug.OSS1.out.gz) to ftp site: /uploads/LU6696/lbug.OSS1.out.gz

            mhanafi Mahmoud Hanafi added a comment - I uploaded logs (lbug.OSS1.out.gz) to ftp site: /uploads/LU6696/lbug.OSS1.out.gz
            bobijam Zhenyu Xu added a comment -

            I want to know the cause of the llog process failure, can you upload logs from OST0024?

            bobijam Zhenyu Xu added a comment - I want to know the cause of the llog process failure, can you upload logs from OST0024?

            the mds crashes so I can't give any debug logs.

            Do you want them from the oss? what specific debug settings?

            mhanafi Mahmoud Hanafi added a comment - the mds crashes so I can't give any debug logs. Do you want them from the oss? what specific debug settings?

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: