[LU-6556] changelog catalog corruption if all possible records is define Created: 02/May/15  Updated: 07/Jan/16  Resolved: 28/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Antoine Percher Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

redhat kernel 2.6.32_504.8.1.el6
lustre 2;5.3.90 + LU-5740 LU-4582 LU-5678 LU-5393 LU-3727 LU-4528 LU-5522 LU-5264 LU-6049 LU-6084 LU-5764


Attachments: Text File log_lu-6556_b.txt    
Issue Links:
Related
is related to LU-4528 osd_trans_exec_op()) ASSERTION( oti->... Resolved
is related to LU-7329 sanity test_60a timeouts with “* invo... Resolved
is related to LU-7138 LBUG: (osd_handler.c:1017:osd_trans_s... Resolved
is related to LU-6634 (osd_handler.c:901:osd_trans_start())... Resolved
is related to LU-7241 sanity test_60a: there are no more fr... Resolved
is related to LU-6954 LustreError: 12934:0:(mdd_device.c:30... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After our last lustre upgrade, On tera100 and tgcc site, some
lustre fs have meet the same corruption on the changelog_catalog
The robinhood node panic like in the LU-6471 and the crash analyze
show that is the changelog-catalog file that have a corruption.
The file is too big than the maximum size of this type of file and
the record who produces the panic is not in the right place.



 Comments   
Comment by Peter Jones [ 02/May/15 ]

Bruno

Can you please confirm whether this does indeed meet the profile of LU-6471?

Thanks

Peter

Comment by Bruno Faccini (Inactive) [ 02/May/15 ]

Assigning to me since I have been working on this issue when being on-site with Antoine.

In fact, after our joint analysis, it seems that the crash has occurred a few time after the upgrade+reboot of all nodes because of the 2 combined things :
1) a ChangeLog Catalog that has already looped-back
2) integrating/running with LU-4528 patch that seems to have introduced a regression where looped-back Catalogs are not handled correctly and are only expected to grow when it is not the case.

Will push a patch soon to master where the problem seems to be also present (but still undetected due to the need of a looped-back Catalog situation to trigger which should not occur so frequently ...).

Comment by Bruno Faccini (Inactive) [ 02/May/15 ]

Peter, no in this particular case the similar LBUG than for LU-6471 is valid since it is triggered due to the ChangeLog Catalog corruption I just explained before in my previous comment.

Comment by Antoine Percher [ 11/May/15 ]

Please find in attachment customer trace analyze file (log_lu-6556_b.txt)
to understand the changelog-catalog coruption

Comment by Gerrit Updater [ 21/May/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912
Subject: LU-6556 obdclass: re-allow catalog loopback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

Comment by Bruno Faccini (Inactive) [ 23/May/15 ]

I have created LU-6634 because during my testing of my patch for this/LU-6556 ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

Comment by Andreas Dilger [ 14/Sep/15 ]

I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

Comment by Robert Read (Inactive) [ 14/Sep/15 ]

I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

Comment by Gerrit Updater [ 17/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14912/
Subject: LU-6556 obdclass: re-allow catalog to wrap around
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4691290f6d39bffaa3e463697fbc3ac351015e76

Comment by Bruno Faccini (Inactive) [ 23/Oct/15 ]

Andreas, Robert,
I also think that your concerns are really good points for more ChangeLogs related enhancements, but also that they should addressed in a separate ticket, when this ticket could now be closed.
Do you agree ?

Comment by Andreas Dilger [ 25/Oct/15 ]

Bruno, that is fine. Please file a separate bug and copy over relevant comments before closing this one, so that they are not forgotten.

Comment by Bruno Faccini (Inactive) [ 26/Oct/15 ]

LU-7340 has been created to address previous ChangeLogs related and more graceful handling of ENOSPC conditions.

Comment by Joseph Gmitter (Inactive) [ 28/Oct/15 ]

Landed to 2.8

Comment by Andreas Dilger [ 10/Dec/15 ]

Backported of the http://review.whamcloud.com/14912 patch also need the patch http://review.whamcloud.com/17052 "LU-7329 obdclass: sync device to flush journal callbacks" patch to avoid introducing test failures in sanity test_60a.

Generated at Sat Feb 10 02:01:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.