[LU-7010] "Local llog found corrupted" during DNE2 recovery Created: 16/Aug/15 Updated: 21/Sep/15 Resolved: 21/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mikhail Pershin | Assignee: | Di Wang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Recent recovery issues in Maloo show the following: 00000040:00020000:1.0:1439669743.543046:0:5740:0:(llog.c:489:llog_process_thread()) Local llog found corrupted 00000040:00100000:1.0:1439669743.545890:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 1 in log 0x1:1024 00000040:00100000:1.0:1439669743.546205:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64838 in log 0x1:1024 00000040:00100000:1.0:1439669743.546229:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64864 in log 0x1:1024 00000040:00100000:1.0:1439669743.546242:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64896 in log 0x1:1024 00000040:00100000:1.0:1439669743.546254:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64897 in log 0x1:1024 00000040:00100000:1.0:1439669743.546267:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64899 in log 0x1:1024 As I can see, the DNE2 'update recovery' may return -EIO error if some update was applied with error. That cause whole llog processing to stop and cancel all other updates. After that recovery stops with various errors. Here is an example, test_70b: |
| Comments |
| Comment by Peter Jones [ 14/Sep/15 ] |
|
Di is taking care of this one |
| Comment by Di Wang [ 15/Sep/15 ] |
|
Hmm, It seems to me, this corruption is related this unlanded patch http://review.whamcloud.com/#/c/14912/ And this patch did change something inside llog, so it is quite possible related with this specific patch. |
| Comment by Mikhail Pershin [ 15/Sep/15 ] |
|
I see that this problem is not related with that patch, it already exists in code. See, the llog_process_thread() has an old code about the corruption handling: if (unlikely(rc == -EIO && loghandle->lgh_obj != NULL)) { /* something bad happened to the processing of a local * llog file, probably I/O error or the log got corrupted.. * to be able to finally release the log we discard any * remaining bits in the header */ CERROR("Local llog found corrupted\n"); while (index <= last_index) { if (ext2_test_bit(index, LLOG_HDR_BITMAP(llh)) != 0) llog_cancel_rec(lpi->lpi_env, loghandle, index); index++; } rc = 0; } Meanwhile the update_recovery.c introduced new callbacks which may return -EIO. Technically that doesn't mean the llog itself is corrupted, and we shouldn't cancel all other llog records. I think we shouldn't use EIO error code there at all. |
| Comment by Di Wang [ 16/Sep/15 ] |
|
Mike: I understand the corrupt checking code is already there. But I mean the reason to cause this corruption is quite related with the change, since this "Local llog found corruption" error seems happen on every run of that patch. I did not see this on other patch. Do I miss sth? Do you know why did it return -EIO? could you please explain here. Thanks. |
| Comment by Mikhail Pershin [ 16/Sep/15 ] |
|
It is not only happening with that patch, in fact I don't see even how it can be related. Meanwhile it looks like duplicate of I am not sure about the reason, but I see that -EIO from llog callbacks also will cause this message and llog cancelling, maybe update llog processing callback may return -EIO? |
| Comment by Di Wang [ 16/Sep/15 ] |
|
IMHO, this EIO usually comes from llog_osd_next_block(), which means the local llog is indeed corrupted. Hmm, https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2 is also from the patch 14912, do I miss sth? |
| Comment by Di Wang [ 16/Sep/15 ] |
|
Hmm, I checked the failure in |
| Comment by Mikhail Pershin [ 16/Sep/15 ] |
|
what mistake in 14912 do you mean, could you explain, about llh_size? If some reports from |
| Comment by Di Wang [ 16/Sep/15 ] |
what mistake in 14912 do you mean, could you explain, about llh_size? Yes, it can not use llh_size to calculate the write offset, because for update log, even it is NOT fixed size update record, llh_size are still > 0. So if the write_offset is wrong, then new write will ruin the llog anyway. |
| Comment by Di Wang [ 21/Sep/15 ] |
|
this is clearly caused by patch http://review.whamcloud.com/#/c/14912/ , since that patch is not landed yet. I will close this one to avoid duplicate efforts. |
| Comment by Di Wang [ 21/Sep/15 ] |
|
duplicate with |