Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.3
-
None
-
3
-
9223372036854775807
Description
After our last lustre upgrade, On tera100 and tgcc site, some
lustre fs have meet the same corruption on the changelog_catalog
The robinhood node panic like in the LU-6471 and the crash analyze
show that is the changelog-catalog file that have a corruption.
The file is too big than the maximum size of this type of file and
the record who produces the panic is not in the right place.
Attachments
Issue Links
- is related to
-
LU-7138 LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:
-
- Resolved
-
-
LU-6634 (osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: when reaching Catalog full condition
-
- Resolved
-
-
LU-7241 sanity test_60a: there are no more free slots in catalog
-
- Resolved
-
-
LU-6954 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5
-
- Closed
-
- is related to
-
LU-4528 osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0
-
- Resolved
-
-
LU-7329 sanity test_60a timeouts with “* invoking oom-killer”
-
- Resolved
-
I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.