[LU-16203] zero records and empty plain llogs in update llog catalog Created: 03/Oct/22 Updated: 26/Sep/23 Resolved: 08/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mikhail Pershin | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
there are reported situations when llog catalog has zeros at record place in the middle of llog. Also reported catalog has created but never used plain llogs or partially used though never deleted plain llogs in the middle. Whole problem could be related to 'next' llog handling for remote llog, it looks like several next llogs could be created at the same time with only one (the last one) used actually. If some of creation failed for any reason, the last one still would have all bits are set in header update and will write it down while records are missing. That would explain zeroed holes in the middle of catalog. Details are not clear yet, it is to investigate. Meanwhile catalog llog could still be processed even with such corruptions and that could help to handle such situations gracefully while problem is being investigated |
| Comments |
| Comment by Gerrit Updater [ 05/Oct/22 ] |
|
"Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48776 |
| Comment by Gerrit Updater [ 17/Oct/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48902 |
| Comment by Mikhail Pershin [ 27/Oct/22 ] |
|
regarding several crashes during testing. I've checked dump and found that list head is corrupted: struct lu_context {
lc_tags = 256,
lc_state = LCS_FINALIZED,
lc_thread = 0xffff8f0bbb382100,
lc_value = 0xffff8f0b5e215800,
lc_remember = {
next = 0x6400000100, <------ is not expected
prev = 0xffff8f0b8cb4fb98
},
lc_version = 45,
lc_cookie = 0
}
considering that lc_remembered is not used in this context ever, it looks like write at wrong address or by stale pointer maybe. I didn't find any further clues yet |
| Comment by Mikhail Pershin [ 27/Oct/22 ] |
|
so far I tend to think this is unrelated to patch itself but more likely casue issue to be seen. I think we can land patch as is, in meantime I'd re-check llog_test for possible lu_env/lu_context usage issues |
| Comment by Gerrit Updater [ 08/Nov/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48776/ |
| Comment by Peter Jones [ 08/Nov/22 ] |
|
Landed for 2.16 |