[LU-16203] zero records and empty plain llogs in update llog catalog Created: 03/Oct/22  Updated: 26/Sep/23  Resolved: 08/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Gantt End to Start
Related
is related to LU-15938 MDT recovery did not finish due to co... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

there are reported situations when llog catalog has zeros at record place in the middle of llog. Also reported catalog has created but never used plain llogs or partially used though never deleted plain llogs in the middle. 

Whole problem could be related to 'next' llog handling for remote llog, it looks like several next llogs could be created at the same time with only one (the last one) used actually. If some of creation failed for any reason, the last one still would have all bits are set in header update and will write it down while records are missing. That would explain zeroed holes in the middle of catalog. Details are not clear yet, it is to investigate.

Meanwhile catalog llog could still be processed even with such corruptions and that could help to handle such situations gracefully while problem is being investigated



 Comments   
Comment by Gerrit Updater [ 05/Oct/22 ]

"Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48776
Subject: LU-16203 llog: skip bad records in llog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 43bdf80cbf52c4a2fc27aebe49c8996fa3818fa5

Comment by Gerrit Updater [ 17/Oct/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48902
Subject: LU-16203 llog: skip bad records in llog
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 8b5e5ccfa8ff52703ed9c860e5b27f07ccdc2fb1

Comment by Mikhail Pershin [ 27/Oct/22 ]

regarding several crashes during testing. I've checked dump and found that list head is corrupted:

struct lu_context {
  lc_tags = 256, 
  lc_state = LCS_FINALIZED, 
  lc_thread = 0xffff8f0bbb382100, 
  lc_value = 0xffff8f0b5e215800, 
  lc_remember = {
    next = 0x6400000100, <------ is not expected
    prev = 0xffff8f0b8cb4fb98
  }, 
  lc_version = 45, 
  lc_cookie = 0
} 

considering that lc_remembered is not used in this context ever, it looks like write at wrong address or by stale pointer maybe. I didn't find any further clues yet

Comment by Mikhail Pershin [ 27/Oct/22 ]

so far I tend to think this is unrelated to patch itself but more likely casue issue to be seen. I think we can land patch as is, in meantime I'd re-check llog_test for possible lu_env/lu_context usage issues

Comment by Gerrit Updater [ 08/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48776/
Subject: LU-16203 llog: skip bad records in llog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf121b16685fe2a271b1b3c5e97eabcfe01aac8a

Comment by Peter Jones [ 08/Nov/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:24:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.