[LU-6556] changelog catalog corruption if all possible records is define Created: 02/May/15 Updated: 07/Jan/16 Resolved: 28/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Antoine Percher | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
redhat kernel 2.6.32_504.8.1.el6 |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
After our last lustre upgrade, On tera100 and tgcc site, some |
| Comments |
| Comment by Peter Jones [ 02/May/15 ] |
|
Bruno Can you please confirm whether this does indeed meet the profile of Thanks Peter |
| Comment by Bruno Faccini (Inactive) [ 02/May/15 ] |
|
Assigning to me since I have been working on this issue when being on-site with Antoine. In fact, after our joint analysis, it seems that the crash has occurred a few time after the upgrade+reboot of all nodes because of the 2 combined things : Will push a patch soon to master where the problem seems to be also present (but still undetected due to the need of a looped-back Catalog situation to trigger which should not occur so frequently ...). |
| Comment by Bruno Faccini (Inactive) [ 02/May/15 ] |
|
Peter, no in this particular case the similar LBUG than for |
| Comment by Antoine Percher [ 11/May/15 ] |
|
Please find in attachment customer trace analyze file (log_lu-6556_b.txt) |
| Comment by Gerrit Updater [ 21/May/15 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14912 |
| Comment by Bruno Faccini (Inactive) [ 23/May/15 ] |
|
I have created |
| Comment by Andreas Dilger [ 14/Sep/15 ] |
|
I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left. |
| Comment by Robert Read (Inactive) [ 14/Sep/15 ] |
|
I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync. |
| Comment by Gerrit Updater [ 17/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14912/ |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/15 ] |
|
Andreas, Robert, |
| Comment by Andreas Dilger [ 25/Oct/15 ] |
|
Bruno, that is fine. Please file a separate bug and copy over relevant comments before closing this one, so that they are not forgotten. |
| Comment by Bruno Faccini (Inactive) [ 26/Oct/15 ] |
|
|
| Comment by Joseph Gmitter (Inactive) [ 28/Oct/15 ] |
|
Landed to 2.8 |
| Comment by Andreas Dilger [ 10/Dec/15 ] |
|
Backported of the http://review.whamcloud.com/14912 patch also need the patch http://review.whamcloud.com/17052 " |