[LU-4023] wrong type used for ldlm flag - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Labels:
None
Environment:
clang 3.4

Severity:
3
Rank (Obsolete):
10815

Description

While porting lustre to macos i found a some strange behavior - likely wrong type used for variable.

osc_request.c:2503:41: error: implicit conversion from 'unsigned long long' to 'int' changes value from 2199023255552 to 0 [-Werror,-Wconstant-conversion]
        int match_lvb = (agl != 0 ? 0 : LDLM_FL_LVB_READY);
            ~~~~~~~~~                   ^~~~~~~~~~~~~~~~~
/Users/shadow/work/lustre/work/WC-review/mac-os/lustre/include/lustre_dlm_flags.h:244:41: note: expanded from macro 'LDLM_FL_LVB_READY'
#define LDLM_FL_LVB_READY               0x0000020000000000ULL // bit  41

so if we assign 64bit data to 32bit variable - we will be have zero always and that logic newer work.

Attachments

Activity

[LU-4023] wrong type used for ldlm flag

Jodi Levi (Inactive) made changes - 03/Jun/14 12:42 PM

Fix Version/s

New: Lustre 2.6.0 [ 10595 ]

Peter Jones made changes - 31/Oct/13 1:14 AM

Labels

Original: mq413

Andreas Dilger made changes - 21/Oct/13 9:58 PM

Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Andreas Dilger added a comment - 21/Oct/13 9:58 PM

I don't think it is a big deal either way, but the reason I point at the b58a810 commit ("LU-1684 ldlm: move ldlm flags not sent through wire to upper 32bits") is because it moved the LDLM_FL_LVB_READY value from fitting in an int to not fitting in an int, which is what affected osc_enqueue_base():

-#define LDLM_FL_LVB_READY      0x100000
+#define LDLM_FL_LVB_READY      0x20000000000ULL

It still isn't clear what effect this had (if any?) on the behaviour of osc_enqueue_base(). I might guess there was some kind of race that might lead "ls -l" and statahead requests from returning data from OST extent locks that hadn't actually been granted yet (because ldlm_lock_match() was not sleeping until LVB_READY was set on the lock)? In any case, I'm glad that this problem was found and fixed.

Shadow, if you think there is still a problem in osc_enqueue_base->ldlm_lock_match() because of this change back to the pre-2.3.64 behaviour (which has existed since v2_1_53_0-40-g6f5813d then please file a separate bug with details on how the current code is broken. I agree that it is possible that something broke by fixing osc_enqueue_base() to pass LVB_READY down to ldlm_lock_match(), but a quick look doesn't show anything obvious.

Andreas Dilger added a comment - 21/Oct/13 9:58 PM I don't think it is a big deal either way, but the reason I point at the b58a810 commit (" LU-1684 ldlm: move ldlm flags not sent through wire to upper 32bits ") is because it moved the LDLM_FL_LVB_READY value from fitting in an int to not fitting in an int, which is what affected osc_enqueue_base(): -#define LDLM_FL_LVB_READY 0x100000 +#define LDLM_FL_LVB_READY 0x20000000000ULL It still isn't clear what effect this had (if any?) on the behaviour of osc_enqueue_base(). I might guess there was some kind of race that might lead "ls -l" and statahead requests from returning data from OST extent locks that hadn't actually been granted yet (because ldlm_lock_match() was not sleeping until LVB_READY was set on the lock)? In any case, I'm glad that this problem was found and fixed. Shadow, if you think there is still a problem in osc_enqueue_base->ldlm_lock_match() because of this change back to the pre-2.3.64 behaviour (which has existed since v2_1_53_0-40-g6f5813d then please file a separate bug with details on how the current code is broken. I agree that it is possible that something broke by fixing osc_enqueue_base() to pass LVB_READY down to ldlm_lock_match(), but a quick look doesn't show anything obvious.

Andreas Dilger made changes - 21/Oct/13 8:53 PM

Fix Version/s		New: Lustre 2.5.0 [ 10295 ]
Fix Version/s	Original: Lustre 2.5.1 [ 10608 ]

Andreas Dilger made changes - 21/Oct/13 8:51 PM

Labels

New: mq413

Vitaly Fertman added a comment - 14/Oct/13 12:15 PM

Andreas Dilger added a comment - 01/Oct/13 4:28 PM
> This code has been broken since Vitaly landed the patch b58a810 in 2.3.54.
> It isn't clear why we didn't notice this failure,

probably just because it WAS NOT broken since Vitaly landed that patch and the failure was introduced later by Intel

Vitaly Fertman added a comment - 14/Oct/13 12:15 PM Andreas Dilger added a comment - 01/Oct/13 4:28 PM > This code has been broken since Vitaly landed the patch b58a810 in 2.3.54. > It isn't clear why we didn't notice this failure, probably just because it WAS NOT broken since Vitaly landed that patch and the failure was introduced later by Intel

Andreas Dilger added a comment - 04/Oct/13 8:16 PM

I think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers.

In my opinion (I can't speak for Dmitry) your question was more of a feature request, and not an objection to the patch. I agree that replying to inspector comments is important for everyone, regardless of where they are working. Maybe at another time this patch would have taken longer to be inspected and landed, but we are trying to finish all of the blocker bugs with enough time to test the code enough before release. I also agree that it makes sense for you to file a new bug with your feature request and attach the patch to that bug, so that this one can be closed for 2.5.0.

i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see.

Nowhere in your comments did you mention testing the patch with clang, and I can't read your mind. I'm definitely interested in static code analysis (see ~~LU-871~~ and ~~LU-2675~~ regarding clang, and many patches from Bull landed under ~~LU-2753~~ from Coverity). I will be happy to land any patches you submit for defects you find with clang or sparse or other static analysis tools.

i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results.

This is something that happens fairly regularly with our own inspectors - someone marks a patch -1 if they think there is an issue, and the change it to +1 if their issue is explained. It would be even better if some code is unclear to add a comment in the code to explain it rather than just a note in Gerrit. I think it is untrue to say that nobody is interested with review results just because of this minor issue. I would be happy to have more review from Xyratex developers, even on patches that are not their own. Reviewing patches is as much work as developing them in the first place, and if more people review a patch we can hope that they will find more bugs.

that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match?

I agree, though the code was (presumably) working before Vitaly's patch http://review.whamcloud.com/3494 broke it and we can hope it is working again. I'm happy to have your analysis of what was broken by that patch, and a new test that exercises this code better.

To be honest, it makes me wonder what was broken/fixed with this change? Something related to async locking for statahead, but I wonder how it would be visible to userspace? Would it just mean that statahead needs to enqueue the lock again? Is a race when accessing lock LVB data? Is this code doing nothing useful and can be deleted?

Andreas Dilger added a comment - 04/Oct/13 8:16 PM I think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers. In my opinion (I can't speak for Dmitry) your question was more of a feature request, and not an objection to the patch. I agree that replying to inspector comments is important for everyone, regardless of where they are working. Maybe at another time this patch would have taken longer to be inspected and landed, but we are trying to finish all of the blocker bugs with enough time to test the code enough before release. I also agree that it makes sense for you to file a new bug with your feature request and attach the patch to that bug, so that this one can be closed for 2.5.0. i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see. Nowhere in your comments did you mention testing the patch with clang, and I can't read your mind. I'm definitely interested in static code analysis (see LU-871 and LU-2675 regarding clang, and many patches from Bull landed under LU-2753 from Coverity). I will be happy to land any patches you submit for defects you find with clang or sparse or other static analysis tools. i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results. This is something that happens fairly regularly with our own inspectors - someone marks a patch -1 if they think there is an issue, and the change it to +1 if their issue is explained. It would be even better if some code is unclear to add a comment in the code to explain it rather than just a note in Gerrit. I think it is untrue to say that nobody is interested with review results just because of this minor issue. I would be happy to have more review from Xyratex developers, even on patches that are not their own. Reviewing patches is as much work as developing them in the first place, and if more people review a patch we can hope that they will find more bugs. that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match? I agree, though the code was (presumably) working before Vitaly's patch http://review.whamcloud.com/3494 broke it and we can hope it is working again. I'm happy to have your analysis of what was broken by that patch, and a new test that exercises this code better. To be honest, it makes me wonder what was broken/fixed with this change? Something related to async locking for statahead, but I wonder how it would be visible to userspace? Would it just mean that statahead needs to enqueue the lock again? Is a race when accessing lock LVB data? Is this code doing nothing useful and can be deleted?

Alexey Lyashkov added a comment - 04/Oct/13 7:12 PM

Andreas,

first, i think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers.
If Dmitry was post a reply and create a new ticket with that cleanup it's make different response from my side.
I understand about code freeze, but i don't have guaranteed we don't have other places with same bugs, as i don't able to build kernel part with clang now. It's main objection if we have such bug in two releases - we don't think it's big regression in past. so have some changes in that area?

second - i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see.

third - i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results.

as about patch - that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match?

Alexey Lyashkov added a comment - 04/Oct/13 7:12 PM Andreas, first, i think lack a response to inspector question - it's not a good style. you have a different opinion? if yes - please elaborate why Intel developers allowed to do it, but you strongly agains it from other developers. If Dmitry was post a reply and create a new ticket with that cleanup it's make different response from my side. I understand about code freeze, but i don't have guaranteed we don't have other places with same bugs, as i don't able to build kernel part with clang now. It's main objection if we have such bug in two releases - we don't think it's big regression in past. so have some changes in that area? second - i have plan to test patch with clang to see none breakage introduced, but none interested with verification i see. third - i know about -1 in gerrit, but i don't think i need set to -1 if i have any questions to the author, but now i will set -1 for any patch with any questions and clear only when finished a review - it's easy when see - none interested with review results. as about patch - that patch is not a small change as you think. previously we have drop working any logic related to the alg as match_lvb always zero in past. did you inspect all related code - how it work with LDLM_FL_LVB_READY passed in down call? so that patch may change flag to correct mask but introduce a new bugs due LDLM_FL_LVB_READY will passed to argument of ldlm_lock_match?

Andreas Dilger made changes - 04/Oct/13 6:14 PM

Summary

Original: wrong type used

New: wrong type used for ldlm flag

People

Assignee:: Dmitry Eremin (Inactive)

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Sep/13 7:22 AM

Updated:: 03/Jun/14 12:42 PM

Resolved:: 21/Oct/13 9:58 PM