Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.3
-
None
-
3
-
9223372036854775807
Description
After our last lustre upgrade, On tera100 and tgcc site, some
lustre fs have meet the same corruption on the changelog_catalog
The robinhood node panic like in the LU-6471 and the crash analyze
show that is the changelog-catalog file that have a corruption.
The file is too big than the maximum size of this type of file and
the record who produces the panic is not in the right place.
Attachments
Issue Links
- is related to
-
LU-7138 LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:
-
- Resolved
-
-
LU-6634 (osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: when reaching Catalog full condition
-
- Resolved
-
-
LU-7241 sanity test_60a: there are no more free slots in catalog
-
- Resolved
-
-
LU-6954 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5
-
- Closed
-
- is related to
-
LU-4528 osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0
-
- Resolved
-
-
LU-7329 sanity test_60a timeouts with “* invoking oom-killer”
-
- Resolved
-
I have created
LU-6634because during my testing of my patch for this/LU-6556ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.