[LU-6556] changelog catalog corruption if all possible records is define - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:

Hide
redhat kernel 2.6.32_504.8.1.el6
lustre 2;5.3.90 + ~~LU-5740~~ ~~LU-4582~~ ~~LU-5678~~ ~~LU-5393~~ ~~LU-3727~~ ~~LU-4528~~ ~~LU-5522~~ ~~LU-5264~~ ~~LU-6049~~ ~~LU-6084~~ ~~LU-5764~~

Show
redhat kernel 2.6.32_504.8.1.el6 lustre 2;5.3.90 + LU-5740 LU-4582 LU-5678 LU-5393 LU-3727 LU-4528 LU-5522 LU-5264 LU-6049 LU-6084 LU-5764

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After our last lustre upgrade, On tera100 and tgcc site, some
lustre fs have meet the same corruption on the changelog_catalog
The robinhood node panic like in the ~~LU-6471~~ and the crash analyze
show that is the changelog-catalog file that have a corruption.
The file is too big than the maximum size of this type of file and
the record who produces the panic is not in the right place.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log_lu-6556_b.txt
21 kB
11/May/15 2:06 PM

Issue Links

is related to

LU-7138 LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:

Resolved

LU-6634 (osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: when reaching Catalog full condition

Resolved

LU-7241 sanity test_60a: there are no more free slots in catalog

Resolved

LU-6954 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5

Closed

is related to

LU-4528 osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0

Resolved

LU-7329 sanity test_60a timeouts with “* invoking oom-killer”

Resolved

(1 is related to )

Activity

[LU-6556] changelog catalog corruption if all possible records is define

Robert Read added a comment - 14/Sep/15 6:30 PM

I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

Robert Read added a comment - 14/Sep/15 6:30 PM I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.

Andreas Dilger added a comment - 14/Sep/15 5:59 PM

I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

Andreas Dilger added a comment - 14/Sep/15 5:59 PM I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

Bruno Faccini (Inactive) added a comment - 23/May/15 10:03 PM

I have created ~~LU-6634~~ because during my testing of my patch for this/~~LU-6556~~ ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

Bruno Faccini (Inactive) added a comment - 23/May/15 10:03 PM I have created LU-6634 because during my testing of my patch for this/ LU-6556 ticket, I wanted to check what happen (expected ENOSPC return!) after Catalog has looped-back and fills up, but I got a "(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:" LBUG. I have identified this to be caused by the fact that in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so, and this triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.

Gerrit Updater added a comment - 21/May/15 6:11 PM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912
Subject: ~~LU-6556~~ obdclass: re-allow catalog loopback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

Gerrit Updater added a comment - 21/May/15 6:11 PM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912 Subject: LU-6556 obdclass: re-allow catalog loopback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8112eb1a2d3dd5beed00c75d76295ae1362ab19a

Antoine Percher added a comment - 11/May/15 2:06 PM

Please find in attachment customer trace analyze file (log_lu-6556_b.txt)
to understand the changelog-catalog coruption

Antoine Percher added a comment - 11/May/15 2:06 PM Please find in attachment customer trace analyze file (log_lu-6556_b.txt) to understand the changelog-catalog coruption

Bruno Faccini (Inactive) added a comment - 02/May/15 4:43 PM

Peter, no in this particular case the similar LBUG than for ~~LU-6471~~ is valid since it is triggered due to the ChangeLog Catalog corruption I just explained before in my previous comment.

Bruno Faccini (Inactive) added a comment - 02/May/15 4:43 PM Peter, no in this particular case the similar LBUG than for LU-6471 is valid since it is triggered due to the ChangeLog Catalog corruption I just explained before in my previous comment.

Bruno Faccini (Inactive) added a comment - 02/May/15 4:40 PM - edited

Assigning to me since I have been working on this issue when being on-site with Antoine.

In fact, after our joint analysis, it seems that the crash has occurred a few time after the upgrade+reboot of all nodes because of the 2 combined things :
1) a ChangeLog Catalog that has already looped-back
2) integrating/running with ~~LU-4528~~ patch that seems to have introduced a regression where looped-back Catalogs are not handled correctly and are only expected to grow when it is not the case.

Will push a patch soon to master where the problem seems to be also present (but still undetected due to the need of a looped-back Catalog situation to trigger which should not occur so frequently ...).

Bruno Faccini (Inactive) added a comment - 02/May/15 4:40 PM - edited Assigning to me since I have been working on this issue when being on-site with Antoine. In fact, after our joint analysis, it seems that the crash has occurred a few time after the upgrade+reboot of all nodes because of the 2 combined things : 1) a ChangeLog Catalog that has already looped-back 2) integrating/running with LU-4528 patch that seems to have introduced a regression where looped-back Catalogs are not handled correctly and are only expected to grow when it is not the case. Will push a patch soon to master where the problem seems to be also present (but still undetected due to the need of a looped-back Catalog situation to trigger which should not occur so frequently ...).

Peter Jones added a comment - 02/May/15 4:32 PM

Bruno

Can you please confirm whether this does indeed meet the profile of ~~LU-6471~~?

Thanks

Peter

Peter Jones added a comment - 02/May/15 4:32 PM Bruno Can you please confirm whether this does indeed meet the profile of LU-6471 ? Thanks Peter

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Antoine Percher

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 02/May/15 3:46 PM

Updated:: 07/Jan/16 5:43 AM

Resolved:: 28/Oct/15 6:30 PM