[LU-3110] Disable osd declaration tracking for 2.4 release Created: 04/Apr/13 Updated: 12/Sep/13 Resolved: 13/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7560 | ||||||||
| Description |
|
Currently OSD code is very strict about op declaration matching actual number of operations and crashes if they don't match (e.g. It seems there is a compile-time switch OSD_TRACK_DECLARES, but I think it would be even better to make it into a proc variable, disabled by default, that we then can turn on at runtime during e.g. testing (to better highlight the issues), but actual customers don't get any crashes. |
| Comments |
| Comment by Peter Jones [ 04/Apr/13 ] |
|
Bruno Could you please take care of this one? Thanks Peter |
| Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ] |
|
Sure, I already fight against OSD_TRACK_DECLARE as part of |
| Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ] |
|
Oleg, |
| Comment by Oleg Drokin [ 09/Apr/13 ] |
|
Well, basically check any code that does a binary switch like sync_journal variable. Then in the place that does assertion on op credit mismatch, convert assertion to CERROR, and then if (crash_on_op_mismatch) LBUG(); |
| Comment by Andreas Dilger [ 09/Apr/13 ] |
|
There is already a check in lustre/osd-ldiskfs/osd_internal.h to disable this code after 2.3.90: #if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 3, 90, 0)
# define OSD_TRACK_DECLARES
#endif
It would still be nice to be able to enable this at runtime, but it also has some extra memory overhead per transaction, so we might consider moving the tracking structure into a separate allocation only when this tracking is enabled. Bruno, in the short term, can you verify that changing the build version to 2.3.90 in lustre/autoconf/lustre_version.ac causes this code to actually be disabled and does not cause other problems? |
| Comment by Bruno Faccini (Inactive) [ 10/Apr/13 ] |
|
Andreas, during I am in the process to double-check, as you requested. On the other hand, I am working on a patch to allow disable/enable of transaction ops tracking, and will try to integrate the dynamic allocation you pointed. |
| Comment by Bruno Faccini (Inactive) [ 10/Apr/13 ] |
|
Humm, no way, trying to undefine OSD_TRACK_DECLARES by setting build version to 2.3.90 causes further build to fail when it triggers the following sequence in lustre/obdclass/local_storage.c : 720 #if LUSTRE_VERSION_CODE >= OBD_OCD_VERSION(2, 3, 90, 0) 721 #error "fix this before release" 722 #endif thus if we want to build with OSD_TRACK_DECLARES undefined, actually the best way seems to delete the following lines in lustre/osd-ldiskfs/osd_internal.h : 322 #if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 3, 90, 0) 323 # define OSD_TRACK_DECLARES 324 #endif as we did with Minh during our debugging work for |
| Comment by Bruno Faccini (Inactive) [ 10/Apr/13 ] |
|
BTW, patch is under local-testing now, hope to be able to push it soon. |
| Comment by Andreas Dilger [ 10/Apr/13 ] |
|
I'm going to file a separate bug for the LUSTRE_VERSION_CODE check. |
| Comment by Andreas Dilger [ 10/Apr/13 ] |
|
Filed |
| Comment by Bruno Faccini (Inactive) [ 11/Apr/13 ] |
|
Patch for master pushed at http://review.whamcloud.com/6032. I tried to implement the dynamic way you wanted Andreas. So basically here is what I did : _ removed compile-time OSD_TRACK_DECLARES way. I did local intensive tests/checks of the patch running racer and concurrent enable/disable of tracking, also verified no mem-leak was introduced. Thank's to Johann to discuss with me on possible implementation ways and have refreshed me on the transaction's life-cycle !!.. This ticket+patch is a direct follow-on of |
| Comment by Bruno Faccini (Inactive) [ 22/Apr/13 ] |
|
Patch-set #3 tries to implement more strictly what you requested, and also tries to take care of reviewers comments : _ again use of old define/compile-time OSD_TRACK_DECLARES is fully removed. Again I did local+intensive testing of the patch in normal conditions, will try to inject some code to induce tracking declares overflow/disfunction, and see what happen ... |
| Comment by Bruno Faccini (Inactive) [ 24/Apr/13 ] |
|
Since I am not really proud with my main mistake in patch-set #3 ..., I am currently in the process to better check/test the new patch-set #4 where I try to implement all my reviewers very constructive and helpful comments. |
| Comment by Bruno Faccini (Inactive) [ 25/Apr/13 ] |
|
Here is patch-set #4, I re-wrote all conditional statements in osd_internal.h according to reviewers advices, only assuming in addition that LASSERTs are no-return (ie, panic() or schedule() for ever, depending on panic_on_lbug). The env variable add in test-famework.sh/check_and_setup_lustre() may also need some filtering (MDSs/OSSs devices of ldiskfs type only ...), but we can assume that people using this know what they want to do. I also did some intensive testing error-injection allowing either CWARNs or LBUGs to occur ... |
| Comment by Bruno Faccini (Inactive) [ 26/Apr/13 ] |
|
I am stuck with the fact that checkpatch.pl complains an "ERROR: do not initialise globals to 0 or NULL" when I initialize my global ldiskfs_track_declares_assert to the 0/OFF value during its declaration, because to initialize it later conflicts with its also possible setting as a module-parameter. This comes from changes in checkpatch.pl from commit 0b554ff2/I26fc5a5c for I may try to revert the meaning of my flag to something like ldiskfs_track_declares_noassert and be able to initialize it to 1, and then change allrelated tests, but is it really a rule we need to follow ?? |
| Comment by Bruno Faccini (Inactive) [ 26/Apr/13 ] |
|
Ok, I did it finally ... So patch-set #5 just submitted with a reverted flag ( [ldiskfs_]track_declares_noassert) to make checkpatch.pl and Kernel rules happy !! BTW, I submitted it without exposure to my own/local testing and rely on reviewers to check the reverted logic ;-} |
| Comment by Andreas Dilger [ 26/Apr/13 ] |
|
Bruno, just for future reference, the reason that checkpatch complains about initializing the parameter to zero is because all global variables are automatically initialized to zero already. If they are explicitly initialized to zero, then they make the binary larger (they are stored in a special code segment), while the automatically-initialized-to-zero variables do not need this. |
| Comment by Bruno Faccini (Inactive) [ 27/Apr/13 ] |
|
Andreas, thanks for the explanation. In fact, when trying to understand, I found some contradictory (seems it is an old subject/debate) infos about this and thus was unsure if I can assume the automatic zero initialization or not. Thus what should I do keep the current and reverted logic or back to the original and assume zero initialisation ?? |
| Comment by Andreas Dilger [ 27/Apr/13 ] |
|
I found the new inverted logic to be more confusing to read, and hasn't read the comments here about why it was being done this way. My preference would be to go back to the original "default zero" logic. It is definitely good to keep this checking enabled in the test framework, but default to off for regular users. Please also fix the /proc save/restore problem I mentioned. It would be fine to just use "0" and "1" for the output if you don't have time to add support to the write handler to accept "on" and "off" in addition to "1" and "0". |
| Comment by Bruno Faccini (Inactive) [ 28/Apr/13 ] |
|
Patch-set #6 just submitted, back to original logic (and zero initialization of globals assumption!), with cosmetic fixes to comply with Andreas last comments. |
| Comment by Bruno Faccini (Inactive) [ 29/Apr/13 ] |
|
Patch-set #7 submitted with print of stack-trace upon warning. I did not put some rate-limiting code since it is assumed that these tracking of declares errors are very unlikely to occur. |
| Comment by Andreas Dilger [ 29/Apr/13 ] |
|
Patch set #7 is going to be landed, so this bug is no longer a blocker. However, another patch should be submitted so the checking is enabled by default during testing, and only for osd-ldiskfs. |
| Comment by Bruno Faccini (Inactive) [ 02/May/13 ] |
|
A follow-on patch will implement latest comments from Andreas with the following add-ons : _ set OSD_TRACK_DECLARES_LBUG by default in the test-framework.sh init_test_env() so that this is always being tested. _ in test-famework.sh/check_and_setup_lustre(), set the track_declares_assert tunable only be set for osd-ldiskfs, and not osd-*, since this functionality is not present in osd-zfs. This will need some filtering (MDSs/OSSs devices of ldiskfs type only ...) to be implemented to prevent warnings. |
| Comment by Bruno Faccini (Inactive) [ 07/May/13 ] |
|
Follow-on patch just pushed at http://review.whamcloud.com/6280. |
| Comment by Bruno Faccini (Inactive) [ 07/May/13 ] |
|
Oops sorry, I did not expose 1st patch version of http://review.whamcloud.com/6280 change to a multi-node configuration, and thus I missed the fact the nodes-list I provided to do_nodes was without commas. |
| Comment by Andreas Dilger [ 13/Jun/13 ] |
|
The test-framework change was landed for 2.5.0 to enable track_declares_assert. However, this caused |