[LU-4023] wrong type used for ldlm flag - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Labels:
None
Environment:
clang 3.4

Severity:
3
Rank (Obsolete):
10815

Description

While porting lustre to macos i found a some strange behavior - likely wrong type used for variable.

osc_request.c:2503:41: error: implicit conversion from 'unsigned long long' to 'int' changes value from 2199023255552 to 0 [-Werror,-Wconstant-conversion]
        int match_lvb = (agl != 0 ? 0 : LDLM_FL_LVB_READY);
            ~~~~~~~~~                   ^~~~~~~~~~~~~~~~~
/Users/shadow/work/lustre/work/WC-review/mac-os/lustre/include/lustre_dlm_flags.h:244:41: note: expanded from macro 'LDLM_FL_LVB_READY'
#define LDLM_FL_LVB_READY               0x0000020000000000ULL // bit  41

so if we assign 64bit data to 32bit variable - we will be have zero always and that logic newer work.

Attachments

Activity

[LU-4023] wrong type used for ldlm flag

Andreas Dilger added a comment - 21/Oct/13 9:58 PM

I don't think it is a big deal either way, but the reason I point at the b58a810 commit ("LU-1684 ldlm: move ldlm flags not sent through wire to upper 32bits") is because it moved the LDLM_FL_LVB_READY value from fitting in an int to not fitting in an int, which is what affected osc_enqueue_base():

-#define LDLM_FL_LVB_READY      0x100000
+#define LDLM_FL_LVB_READY      0x20000000000ULL

It still isn't clear what effect this had (if any?) on the behaviour of osc_enqueue_base(). I might guess there was some kind of race that might lead "ls -l" and statahead requests from returning data from OST extent locks that hadn't actually been granted yet (because ldlm_lock_match() was not sleeping until LVB_READY was set on the lock)? In any case, I'm glad that this problem was found and fixed.

Shadow, if you think there is still a problem in osc_enqueue_base->ldlm_lock_match() because of this change back to the pre-2.3.64 behaviour (which has existed since v2_1_53_0-40-g6f5813d then please file a separate bug with details on how the current code is broken. I agree that it is possible that something broke by fixing osc_enqueue_base() to pass LVB_READY down to ldlm_lock_match(), but a quick look doesn't show anything obvious.

Andreas Dilger added a comment - 21/Oct/13 9:58 PM I don't think it is a big deal either way, but the reason I point at the b58a810 commit (" LU-1684 ldlm: move ldlm flags not sent through wire to upper 32bits ") is because it moved the LDLM_FL_LVB_READY value from fitting in an int to not fitting in an int, which is what affected osc_enqueue_base(): -#define LDLM_FL_LVB_READY 0x100000 +#define LDLM_FL_LVB_READY 0x20000000000ULL It still isn't clear what effect this had (if any?) on the behaviour of osc_enqueue_base(). I might guess there was some kind of race that might lead "ls -l" and statahead requests from returning data from OST extent locks that hadn't actually been granted yet (because ldlm_lock_match() was not sleeping until LVB_READY was set on the lock)? In any case, I'm glad that this problem was found and fixed. Shadow, if you think there is still a problem in osc_enqueue_base->ldlm_lock_match() because of this change back to the pre-2.3.64 behaviour (which has existed since v2_1_53_0-40-g6f5813d then please file a separate bug with details on how the current code is broken. I agree that it is possible that something broke by fixing osc_enqueue_base() to pass LVB_READY down to ldlm_lock_match(), but a quick look doesn't show anything obvious.

Vitaly Fertman added a comment - 14/Oct/13 12:15 PM

Andreas Dilger added a comment - 01/Oct/13 4:28 PM
> This code has been broken since Vitaly landed the patch b58a810 in 2.3.54.
> It isn't clear why we didn't notice this failure,

probably just because it WAS NOT broken since Vitaly landed that patch and the failure was introduced later by Intel

Vitaly Fertman added a comment - 14/Oct/13 12:15 PM Andreas Dilger added a comment - 01/Oct/13 4:28 PM > This code has been broken since Vitaly landed the patch b58a810 in 2.3.54. > It isn't clear why we didn't notice this failure, probably just because it WAS NOT broken since Vitaly landed that patch and the failure was introduced later by Intel

Andreas Dilger added a comment - 04/Oct/13 8:16 PM

I think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers.

In my opinion (I can't speak for Dmitry) your question was more of a feature request, and not an objection to the patch. I agree that replying to inspector comments is important for everyone, regardless of where they are working. Maybe at another time this patch would have taken longer to be inspected and landed, but we are trying to finish all of the blocker bugs with enough time to test the code enough before release. I also agree that it makes sense for you to file a new bug with your feature request and attach the patch to that bug, so that this one can be closed for 2.5.0.

i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see.

Nowhere in your comments did you mention testing the patch with clang, and I can't read your mind. I'm definitely interested in static code analysis (see ~~LU-871~~ and ~~LU-2675~~ regarding clang, and many patches from Bull landed under ~~LU-2753~~ from Coverity). I will be happy to land any patches you submit for defects you find with clang or sparse or other static analysis tools.

i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results.

This is something that happens fairly regularly with our own inspectors - someone marks a patch -1 if they think there is an issue, and the change it to +1 if their issue is explained. It would be even better if some code is unclear to add a comment in the code to explain it rather than just a note in Gerrit. I think it is untrue to say that nobody is interested with review results just because of this minor issue. I would be happy to have more review from Xyratex developers, even on patches that are not their own. Reviewing patches is as much work as developing them in the first place, and if more people review a patch we can hope that they will find more bugs.

that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match?

I agree, though the code was (presumably) working before Vitaly's patch http://review.whamcloud.com/3494 broke it and we can hope it is working again. I'm happy to have your analysis of what was broken by that patch, and a new test that exercises this code better.

To be honest, it makes me wonder what was broken/fixed with this change? Something related to async locking for statahead, but I wonder how it would be visible to userspace? Would it just mean that statahead needs to enqueue the lock again? Is a race when accessing lock LVB data? Is this code doing nothing useful and can be deleted?

Andreas Dilger added a comment - 04/Oct/13 8:16 PM I think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers. In my opinion (I can't speak for Dmitry) your question was more of a feature request, and not an objection to the patch. I agree that replying to inspector comments is important for everyone, regardless of where they are working. Maybe at another time this patch would have taken longer to be inspected and landed, but we are trying to finish all of the blocker bugs with enough time to test the code enough before release. I also agree that it makes sense for you to file a new bug with your feature request and attach the patch to that bug, so that this one can be closed for 2.5.0. i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see. Nowhere in your comments did you mention testing the patch with clang, and I can't read your mind. I'm definitely interested in static code analysis (see LU-871 and LU-2675 regarding clang, and many patches from Bull landed under LU-2753 from Coverity). I will be happy to land any patches you submit for defects you find with clang or sparse or other static analysis tools. i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results. This is something that happens fairly regularly with our own inspectors - someone marks a patch -1 if they think there is an issue, and the change it to +1 if their issue is explained. It would be even better if some code is unclear to add a comment in the code to explain it rather than just a note in Gerrit. I think it is untrue to say that nobody is interested with review results just because of this minor issue. I would be happy to have more review from Xyratex developers, even on patches that are not their own. Reviewing patches is as much work as developing them in the first place, and if more people review a patch we can hope that they will find more bugs. that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match? I agree, though the code was (presumably) working before Vitaly's patch http://review.whamcloud.com/3494 broke it and we can hope it is working again. I'm happy to have your analysis of what was broken by that patch, and a new test that exercises this code better. To be honest, it makes me wonder what was broken/fixed with this change? Something related to async locking for statahead, but I wonder how it would be visible to userspace? Would it just mean that statahead needs to enqueue the lock again? Is a race when accessing lock LVB data? Is this code doing nothing useful and can be deleted?

Alexey Lyashkov added a comment - 04/Oct/13 7:12 PM

Andreas,

first, i think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers.
If Dmitry was post a reply and create a new ticket with that cleanup it's make different response from my side.
I understand about code freeze, but i don't have guaranteed we don't have other places with same bugs, as i don't able to build kernel part with clang now. It's main objection if we have such bug in two releases - we don't think it's big regression in past. so have some changes in that area?

second - i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see.

third - i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results.

as about patch - that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match?

Alexey Lyashkov added a comment - 04/Oct/13 7:12 PM Andreas, first, i think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers. If Dmitry was post a reply and create a new ticket with that cleanup it's make different response from my side. I understand about code freeze, but i don't have guaranteed we don't have other places with same bugs, as i don't able to build kernel part with clang now. It's main objection if we have such bug in two releases - we don't think it's big regression in past. so have some changes in that area? second - i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see. third - i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results. as about patch - that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match?

Andreas Dilger added a comment - 04/Oct/13 6:14 PM

Shadow, I agree that changing the LDLM flags to be an enum is a good code cleanup and will help avoid bugs like this in the future, but I don't think this should prevent the current fix from landing. That is something that should be done as a separate patch, since it will touch a huge amount of code, and that would hide the actual important fix that is in http://review.whamcloud.com/7799. The master branch is currently in code freeze, so only critical bugs are being fixed in the branch, not huge code cleanups.

As for the patch landing to master without all reviewers confirmed - that would make it impossible to land any patch. It isn't possible to remove reviewers from a patch unless they do it themselves, and while we invite many people to inspect patches it is rare that all of the reviewers review every patch (Oleg and I are the only lucky ones that have to review all of the patches). We require two inspections for any patch (at least one must be a more senior developer, as has always been documented at https://wiki.hpdd.intel.com/display/PUB/Submitting+Changes).

If you have a serious objection to the correctness of a patch, it should be marked -1. If needed it is always possible for you to change your review to +1 after discussing the issue with the developer. To me your comment was more of a feature request than an objection to the patch, especially since you didn't mark it -1. In any case it makes sense to separate out patches so each one is just fixing one thing at a time. If you are interested to change the LDLM flags to an enum, please note the comment at the top of lustre_dlm_flags.h that Bruce Korb from Xyratex has changed this to be automatically generated from ./contrib/bit-masks/lustre_dlm_flags.def, so this should be updated instead of editing lustre_dlm_flags.h directly.

Andreas Dilger added a comment - 04/Oct/13 6:14 PM Shadow, I agree that changing the LDLM flags to be an enum is a good code cleanup and will help avoid bugs like this in the future, but I don't think this should prevent the current fix from landing. That is something that should be done as a separate patch, since it will touch a huge amount of code, and that would hide the actual important fix that is in http://review.whamcloud.com/7799 . The master branch is currently in code freeze, so only critical bugs are being fixed in the branch, not huge code cleanups. As for the patch landing to master without all reviewers confirmed - that would make it impossible to land any patch. It isn't possible to remove reviewers from a patch unless they do it themselves, and while we invite many people to inspect patches it is rare that all of the reviewers review every patch (Oleg and I are the only lucky ones that have to review all of the patches). We require two inspections for any patch (at least one must be a more senior developer, as has always been documented at https://wiki.hpdd.intel.com/display/PUB/Submitting+Changes ). If you have a serious objection to the correctness of a patch, it should be marked -1. If needed it is always possible for you to change your review to +1 after discussing the issue with the developer. To me your comment was more of a feature request than an objection to the patch, especially since you didn't mark it -1. In any case it makes sense to separate out patches so each one is just fixing one thing at a time. If you are interested to change the LDLM flags to an enum, please note the comment at the top of lustre_dlm_flags.h that Bruce Korb from Xyratex has changed this to be automatically generated from ./contrib/bit-masks/lustre_dlm_flags.def, so this should be updated instead of editing lustre_dlm_flags.h directly.

Alexey Lyashkov added a comment - 04/Oct/13 2:04 PM

stop talking about blockers. we need proper patch at first place and don't need rewrite a single fix for different branches it's add many problems with supporting that code.

current patch don't fix all similar issues in future and fix just one instance. if we have lost one point in code review where 32bit -> 64bit conversion is lost - you really think we don't have similar bugs in different places? because you review a code? but as you see review isn't guaranteed to kill all bugs.

changes define to enum and use correct type for a ldlm flags avoid bugs at all, because introduce are compile time checking and allocating a property type for all variables and that is provide better code and more optimization for compiler if need.

Alexey Lyashkov added a comment - 04/Oct/13 2:04 PM stop talking about blockers. we need proper patch at first place and don't need rewrite a single fix for different branches it's add many problems with supporting that code. current patch don't fix all similar issues in future and fix just one instance. if we have lost one point in code review where 32bit -> 64bit conversion is lost - you really think we don't have similar bugs in different places? because you review a code? but as you see review isn't guaranteed to kill all bugs. changes define to enum and use correct type for a ldlm flags avoid bugs at all, because introduce are compile time checking and allocating a property type for all variables and that is provide better code and more optimization for compiler if need.

Dmitry Eremin (Inactive) added a comment - 04/Oct/13 12:07 PM

It's blocker and required immediate action for 2.5 release. So, the current patch fix the issue and then we can elaborate other solution.

Dmitry Eremin (Inactive) added a comment - 04/Oct/13 12:07 PM It's blocker and required immediate action for 2.5 release. So, the current patch fix the issue and then we can elaborate other solution.

Alexey Lyashkov added a comment - 04/Oct/13 11:02 AM - edited

I isn't really understand - it's new feature to ignore a requests one of reviewers and push patch anyway?

i put comment in gerrit before new patch test start

Alexey Lyashkov		Oct 3 2:15 PM
Patch Set 4:
may we converted #define to enum to avoid such errors in feature ?

but none response - and patch landed to master without all reviewers confirmed - that patch is OK.

PS. sorry for english - autocompletion was rewrite a future to feature.

Alexey Lyashkov added a comment - 04/Oct/13 11:02 AM - edited I isn't really understand - it's new feature to ignore a requests one of reviewers and push patch anyway? i put comment in gerrit before new patch test start Alexey Lyashkov Oct 3 2:15 PM Patch Set 4: may we converted #define to enum to avoid such errors in feature ? but none response - and patch landed to master without all reviewers confirmed - that patch is OK. PS. sorry for english - autocompletion was rewrite a future to feature.

Dmitry Eremin (Inactive) added a comment - 04/Oct/13 9:56 AM

Patch was landed to master. New patch is for b2_4 http://review.whamcloud.com/#/c/7849/

Dmitry Eremin (Inactive) added a comment - 04/Oct/13 9:56 AM Patch was landed to master. New patch is for b2_4 http://review.whamcloud.com/#/c/7849/

Andreas Dilger added a comment - 01/Oct/13 4:28 PM

This code has been broken since Vitaly landed the patch b58a810 in 2.3.54. It isn't clear why we didn't notice this failure, but it seems that the "obvious" fix is introducing some kind of breakage and we need a closer look at what is going wrong.

Andreas Dilger added a comment - 01/Oct/13 4:28 PM This code has been broken since Vitaly landed the patch b58a810 in 2.3.54. It isn't clear why we didn't notice this failure, but it seems that the "obvious" fix is introducing some kind of breakage and we need a closer look at what is going wrong.

People

Assignee:: Dmitry Eremin (Inactive)

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Sep/13 7:22 AM

Updated:: 03/Jun/14 12:42 PM

Resolved:: 21/Oct/13 9:58 PM