[LU-6696] ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.5.3, Lustre 2.8.0
Labels:
None

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

LustreError: 11-0: hw_nb-OST0016-osc-MDT0000: Communicating with 10.151.26.55@o2ib, operation ost_connect failed with -114.
LustreError: 6488:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5
LustreError: 6580:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5
LustreError: 6580:0:(osp_sync.c:874:osp_sync_thread()) LBUG
Pid: 6580, comm: osp-syn-36-0

Call Trace:
 [<ffffffffa05cf895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa05cfe97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa10d9243>] osp_sync_thread+0x753/0x7d0 [osp]
 [<ffffffff81559b9e>] ? thread_return+0x4e/0x770
 [<ffffffffa10d8af0>] ? osp_sync_thread+0x0/0x7d0 [osp]

Entering kdb (current=0xffff8803b5e04080, pid 6580) on processor 3 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[3]kdb>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cb2000d_9c396a65
12 kB
10/Jun/15 9:20 PM
cb2000d_9c396a65_fixed
12 kB
12/Jun/15 8:25 AM

Issue Links

is related to

LU-9068 Hardware problem resulting in bad blocks

Resolved

LU-8252 MDS kernel panic after aborting journal

Resolved

LU-7011 Kernel part of llog subsystem can do self-repairing in some cases

Resolved

is related to

LU-5056 osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 6 changes, 8 in progress, 0 in flight: -5

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...

(2 mentioned in)

Activity

[LU-6696] ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

Mikhail Pershin added a comment - 10/Jun/15 8:42 PM

It looks like llog has another (or the same) header written from 8192 offset. That is wrong and I'd like to investigate this to understand how that was possible.

Andreas, I agree, OSP code is quite aggressive towards possible IO errors

Mikhail Pershin added a comment - 10/Jun/15 8:42 PM It looks like llog has another (or the same) header written from 8192 offset. That is wrong and I'd like to investigate this to understand how that was possible. Andreas, I agree, OSP code is quite aggressive towards possible IO errors

Mikhail Pershin added a comment - 10/Jun/15 8:39 PM

Can you post that llog file here, please?

Mikhail Pershin added a comment - 10/Jun/15 8:39 PM Can you post that llog file here, please?

Andreas Dilger added a comment - 10/Jun/15 6:40 PM

It should be possible to improve the error handling in this code so that it isn't an LASSERT(), and instead returns an error to the caller. We shouldn't have LASSERT() checks on data that comes from the disk.

Andreas Dilger added a comment - 10/Jun/15 6:40 PM It should be possible to improve the error handling in this code so that it isn't an LASSERT(), and instead returns an error to the caller. We shouldn't have LASSERT() checks on data that comes from the disk.

Mahmoud Hanafi added a comment - 10/Jun/15 2:37 PM

nbphw-mds /mnt/lustre/hw_mdt/OBJECTS # llog_reader cb2000d:9c396a65
rec #0 type=10645539 len=8192
The log is corrupt (too big at 0)
Could not pack buffer; rc=-22

Mahmoud Hanafi added a comment - 10/Jun/15 2:37 PM nbphw-mds /mnt/lustre/hw_mdt/OBJECTS # llog_reader cb2000d:9c396a65 rec #0 type=10645539 len=8192 The log is corrupt (too big at 0) Could not pack buffer; rc=-22

Mahmoud Hanafi added a comment - 10/Jun/15 2:34 PM

How do I get the llog ID?

Mahmoud Hanafi added a comment - 10/Jun/15 2:34 PM How do I get the llog ID?

Zhenyu Xu added a comment - 10/Jun/15 2:24 PM

00000004:00000040:1.0:1433912973.463121:0:6221:0:(osp_sync.c:949:osp_sync_llog_init()) hw_nb-OST0024-osc-MDT0000: Init llog for 36 - catid 0xcb2000d:1:9c396a65

So the llog ID is 0xcb2000d:1:9c396a65 ?

From code llog_process_thread(), the cur_offset is initialized as LLOG_CHUNK_SIZE, so the block was read from 8192, I don't quite know about llog somehow.

Zhenyu Xu added a comment - 10/Jun/15 2:24 PM 00000004:00000040:1.0:1433912973.463121:0:6221:0:(osp_sync.c:949:osp_sync_llog_init()) hw_nb-OST0024-osc-MDT0000: Init llog for 36 - catid 0xcb2000d:1:9c396a65 So the llog ID is 0xcb2000d:1:9c396a65 ? From code llog_process_thread(), the cur_offset is initialized as LLOG_CHUNK_SIZE, so the block was read from 8192, I don't quite know about llog somehow.

Mikhail Pershin added a comment - 10/Jun/15 2:00 PM

there is no way to stop llog processing, except the llog removal by hands. Meanwhile the record type 0x10645539 is llog_header and it is correct that its lrh_index is 0, so probably llog is not corrupted. The question is why header lies at offset 8192 or maybe block was read from offset 0 somehow? Do you know llog ID? Then it can be found in /O/ directory and analyzed, anyway we have to find it. Don't remove this llog, we need its header with all data in it at first.

Mikhail Pershin added a comment - 10/Jun/15 2:00 PM there is no way to stop llog processing, except the llog removal by hands. Meanwhile the record type 0x10645539 is llog_header and it is correct that its lrh_index is 0, so probably llog is not corrupted. The question is why header lies at offset 8192 or maybe block was read from offset 0 somehow? Do you know llog ID? Then it can be found in /O/ directory and analyzed, anyway we have to find it. Don't remove this llog, we need its header with all data in it at first.

Zhenyu Xu added a comment - 10/Jun/15 6:44 AM

from the MDT log

00000040:00000001:3.0:1433912973.463529:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered
00000040:00001000:3.0:1433912973.463531:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 0 off 8192)
00000040:00000001:1.0:1433912973.463674:0:6221:0:(llog_osd.c:652:llog_osd_next_block()) Process leaving via out (rc=0 : 0 : 0x0)
00000040:00000001:1.0:1433912973.463676:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=0 : 0 : 0)
00000040:00001000:1.0:1433912973.463678:0:6221:0:(llog.c:336:llog_process_thread()) processing rec 0xffff88035dcde000 type 0x10645539
00000040:00001000:1.0:1433912973.463680:0:6221:0:(llog.c:342:llog_process_thread()) after swabbing, type=0x10645539 idx=0
00000040:00000001:1.0:1433912973.463682:0:6221:0:(llog.c:347:llog_process_thread()) Process leaving via repeat (rc=0 : 0 : 0x0)
00000040:00001000:1.0:1433912973.463685:0:6221:0:(llog.c:318:llog_process_thread()) index: 61 last_index 64767
00000040:00000001:1.0:1433912973.463687:0:6221:0:(lustre_log.h:510:llog_next_block()) Process entered
00000040:00000001:1.0:1433912973.463688:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered
00000040:00001000:1.0:1433912973.463689:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 63 off 12224)
00000040:00000001:1.0:1433912973.463692:0:6221:0:(llog_osd.c:654:llog_osd_next_block()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
00000040:00000001:1.0:1433912973.463694:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00000040:00000001:1.0:1433912973.463696:0:6221:0:(llog.c:326:llog_process_thread()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
00000040:00000010:1.0:1433912973.463699:0:6221:0:(llog.c:402:llog_process_thread()) kfreed 'buf': 8192 at ffff88035dcde000.
00000040:00000010:1.0:1433912973.463701:0:6221:0:(llog.c:480:llog_process_or_fork()) kfreed 'lpi': 80 at ffff88035e1f51c0.
00000040:00000001:1.0:1433912973.463704:0:6221:0:(llog.c:481:llog_process_or_fork()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00000040:00020000:1.0:1433912973.463706:0:6221:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5

We can see that the llog for this device (OST00024) is corrupted (looking for llog indexed 61, and skipped over to index 63, and find out that the reading offset (12224) is over the llog file size, and -EIO is returned.

Tappro, how can I disable the llog process of this OST device?

Zhenyu Xu added a comment - 10/Jun/15 6:44 AM from the MDT log 00000040:00000001:3.0:1433912973.463529:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered 00000040:00001000:3.0:1433912973.463531:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 0 off 8192) 00000040:00000001:1.0:1433912973.463674:0:6221:0:(llog_osd.c:652:llog_osd_next_block()) Process leaving via out (rc=0 : 0 : 0x0) 00000040:00000001:1.0:1433912973.463676:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=0 : 0 : 0) 00000040:00001000:1.0:1433912973.463678:0:6221:0:(llog.c:336:llog_process_thread()) processing rec 0xffff88035dcde000 type 0x10645539 00000040:00001000:1.0:1433912973.463680:0:6221:0:(llog.c:342:llog_process_thread()) after swabbing, type=0x10645539 idx=0 00000040:00000001:1.0:1433912973.463682:0:6221:0:(llog.c:347:llog_process_thread()) Process leaving via repeat (rc=0 : 0 : 0x0) 00000040:00001000:1.0:1433912973.463685:0:6221:0:(llog.c:318:llog_process_thread()) index: 61 last_index 64767 00000040:00000001:1.0:1433912973.463687:0:6221:0:(lustre_log.h:510:llog_next_block()) Process entered 00000040:00000001:1.0:1433912973.463688:0:6221:0:(llog_osd.c:542:llog_osd_next_block()) Process entered 00000040:00001000:1.0:1433912973.463689:0:6221:0:(llog_osd.c:551:llog_osd_next_block()) looking for log index 61 (cur idx 63 off 12224) 00000040:00000001:1.0:1433912973.463692:0:6221:0:(llog_osd.c:654:llog_osd_next_block()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00000040:00000001:1.0:1433912973.463694:0:6221:0:(lustre_log.h:520:llog_next_block()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000040:00000001:1.0:1433912973.463696:0:6221:0:(llog.c:326:llog_process_thread()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00000040:00000010:1.0:1433912973.463699:0:6221:0:(llog.c:402:llog_process_thread()) kfreed 'buf': 8192 at ffff88035dcde000. 00000040:00000010:1.0:1433912973.463701:0:6221:0:(llog.c:480:llog_process_or_fork()) kfreed 'lpi': 80 at ffff88035e1f51c0. 00000040:00000001:1.0:1433912973.463704:0:6221:0:(llog.c:481:llog_process_or_fork()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000040:00020000:1.0:1433912973.463706:0:6221:0:(llog_cat.c:866:llog_cat_init_and_process()) hw_nb-OST0024-osc-MDT0000: llog_process() with cat_cancel_cb failed: rc = -5 We can see that the llog for this device (OST00024) is corrupted (looking for llog indexed 61, and skipped over to index 63, and find out that the reading offset (12224) is over the llog file size, and -EIO is returned. Tappro, how can I disable the llog process of this OST device?

Mahmoud Hanafi added a comment - 10/Jun/15 5:12 AM

I uploaded debug logs from mdt
ftp:/uploads/LU6696/lustre-log.1433912973.6315.txt

Mahmoud Hanafi added a comment - 10/Jun/15 5:12 AM I uploaded debug logs from mdt ftp:/uploads/LU6696/lustre-log.1433912973.6315.txt

Mahmoud Hanafi added a comment - 10/Jun/15 5:05 AM

btw, it is only the single OST that is causing the LBUG.

Mahmoud Hanafi added a comment - 10/Jun/15 5:05 AM btw, it is only the single OST that is causing the LBUG.

Mahmoud Hanafi added a comment - 10/Jun/15 4:46 AM

The MDS and MGS where located on the same device. We got the errors. as part of debugging I moved the mgs and mdt to different devices. Did a tunefs.lustre --writeconf but got the same error.

What is the fix for the osp_sync_llog_init()

Mahmoud Hanafi added a comment - 10/Jun/15 4:46 AM The MDS and MGS where located on the same device. We got the errors. as part of debugging I moved the mgs and mdt to different devices. Did a tunefs.lustre --writeconf but got the same error. What is the fix for the osp_sync_llog_init()

People

Assignee:: Zhenyu Xu

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 08/Jun/15 6:36 PM

Updated:: 18/Jul/18 6:10 PM

Resolved:: 13/Jul/16 6:08 PM