[LU-7010] "Local llog found corrupted" during DNE2 recovery Created: 16/Aug/15  Updated: 21/Sep/15  Resolved: 21/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Mikhail Pershin Assignee: Di Wang
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7011 Kernel part of llog subsystem can do ... Open
is related to LU-5716 Improve error handling on llog process Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Recent recovery issues in Maloo show the following:

00000040:00020000:1.0:1439669743.543046:0:5740:0:(llog.c:489:llog_process_thread()) Local llog found corrupted
00000040:00100000:1.0:1439669743.545890:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 1 in log 0x1:1024
00000040:00100000:1.0:1439669743.546205:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64838 in log 0x1:1024
00000040:00100000:1.0:1439669743.546229:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64864 in log 0x1:1024
00000040:00100000:1.0:1439669743.546242:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64896 in log 0x1:1024
00000040:00100000:1.0:1439669743.546254:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64897 in log 0x1:1024
00000040:00100000:1.0:1439669743.546267:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64899 in log 0x1:1024

As I can see, the DNE2 'update recovery' may return -EIO error if some update was applied with error. That cause whole llog processing to stop and cancel all other updates. After that recovery stops with various errors.

Here is an example, test_70b:
https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2



 Comments   
Comment by Peter Jones [ 14/Sep/15 ]

Di is taking care of this one

Comment by Di Wang [ 15/Sep/15 ]

Hmm, It seems to me, this corruption is related this unlanded patch http://review.whamcloud.com/#/c/14912/ And this patch did change something inside llog, so it is quite possible related with this specific patch.
Mike, did you see this failure happen on other patches? . If not, I will close this ticket.

Comment by Mikhail Pershin [ 15/Sep/15 ]

I see that this problem is not related with that patch, it already exists in code. See, the llog_process_thread() has an old code about the corruption handling:

	if (unlikely(rc == -EIO && loghandle->lgh_obj != NULL)) {
		/* something bad happened to the processing of a local
		 * llog file, probably I/O error or the log got corrupted..
		 * to be able to finally release the log we discard any
		 * remaining bits in the header */
		CERROR("Local llog found corrupted\n");
		while (index <= last_index) {
			if (ext2_test_bit(index, LLOG_HDR_BITMAP(llh)) != 0)
				llog_cancel_rec(lpi->lpi_env, loghandle, index);
			index++;
		}
		rc = 0;
	}

Meanwhile the update_recovery.c introduced new callbacks which may return -EIO. Technically that doesn't mean the llog itself is corrupted, and we shouldn't cancel all other llog records. I think we shouldn't use EIO error code there at all.

Comment by Di Wang [ 16/Sep/15 ]

Mike: I understand the corrupt checking code is already there. But I mean the reason to cause this corruption is quite related with the change, since this "Local llog found corruption" error seems happen on every run of that patch. I did not see this on other patch. Do I miss sth? Do you know why did it return -EIO? could you please explain here. Thanks.

Comment by Mikhail Pershin [ 16/Sep/15 ]

It is not only happening with that patch, in fact I don't see even how it can be related. Meanwhile it looks like duplicate of LU-6844, some reports there also have the same problem with corrupted log. E.g. check report https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2, MDS1 console log contains:
20:15:58:LustreError: 5740:0:(llog.c:489:llog_process_thread()) Local llog found corrupted

I am not sure about the reason, but I see that -EIO from llog callbacks also will cause this message and llog cancelling, maybe update llog processing callback may return -EIO?

Comment by Di Wang [ 16/Sep/15 ]

IMHO, this EIO usually comes from llog_osd_next_block(), which means the local llog is indeed corrupted. Hmm, https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2 is also from the patch 14912, do I miss sth?
Besides, there is an obvious mistake in 14912, which might corrupt the update llog.

Comment by Di Wang [ 16/Sep/15 ]

Hmm, I checked the failure in LU-6844, I do not think they are related. Most failures there are either due to "No space left".

Comment by Mikhail Pershin [ 16/Sep/15 ]

what mistake in 14912 do you mean, could you explain, about llh_size? If some reports from LU-6844 are also related to patch 14912 then it can be the reason, I agree. Let's wait then for updated patch first.

Comment by Di Wang [ 16/Sep/15 ]
what mistake in 14912 do you mean, could you explain, about llh_size? 

Yes, it can not use llh_size to calculate the write offset, because for update log, even it is NOT fixed size update record, llh_size are still > 0. So if the write_offset is wrong, then new write will ruin the llog anyway.

Comment by Di Wang [ 21/Sep/15 ]

this is clearly caused by patch http://review.whamcloud.com/#/c/14912/ , since that patch is not landed yet. I will close this one to avoid duplicate efforts.

Comment by Di Wang [ 21/Sep/15 ]

duplicate with LU-6556

Generated at Sat Feb 10 02:05:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.