[LU-5423] Test failure sanity-sec test_4: setgroups (2) Created: 28/Jul/14  Updated: 03/Apr/16  Resolved: 29/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 15080

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/067e60a0-091d-11e4-b800-5254006e85c2
https://testing.hpdd.intel.com/test_sets/967a9b9a-14b9-11e4-aae8-5254006e85c2
https://testing.hpdd.intel.com/test_sets/78786676-146f-11e4-88ed-5254006e85c2

The sub-test test_4 failed with the following error:

setgroups (2)

Info required for matching: sanity-sec 4



 Comments   
Comment by Nathaniel Clark [ 28/Jul/14 ]

This shows up in Maloo on 7/11/14

Comment by Andreas Dilger [ 29/Jul/14 ]

Is it possible to track this down to a single patch?

Comment by Zhenyu Xu [ 05/Aug/14 ]

commit db5abf4b2b3fadc054a4e7d0d5a6b2fd9a99023c makes this test fail.

w/o this patch, ll_revalidate_dentry() will validate the local directory dentry (lookup_flags contains LOOKUP_OPEN), and do the permission check locally.

if (lookup_flags & (LOOKUP_PARENT | LOOKUP_OPEN | LOOKUP_CREATE))
          return 1;

As the patch get rid of the line, ll_revalidate_dentry() does not validate the dentry, and llite calls ll_lookup_it(), which expectably does not carry the child's inode hence no suppgid[1], so that mds permission check fails.

+       if (lookup_flags & (LOOKUP_OPEN | LOOKUP_CREATE))
+               return 0;
Comment by Peter Jones [ 05/Aug/14 ]

Oleg

How do you propose to best address this situation?

Thanks

Peter

Comment by Nathaniel Clark [ 05/Aug/14 ]

This seems to be caused by LU-4367 http://review.whamcloud.com/11062

Patch to revert change:
http://review.whamcloud.com/11323

Comment by Oleg Drokin [ 07/Aug/14 ]

The problem here is just broken functionality, I suspect.
Even with the patch reverted the test should fail if the dentry to be revalidated is not there - a hole in the test logic.

If we always revalidate for open - we lose our control of opencache on the other hand. The permission check really needs to be able to work on mds side somehow.

Whenever we disable this test as broken meanwhile or revert the patch - I don't have a strong opinion either way as long as the underlying issue is fixed in the end.

Comment by Oleg Drokin [ 28/Aug/14 ]

I landed a temporary patch to disable this test meanwhile: http://review.whamcloud.com/11631

Comment by Jodi Levi (Inactive) [ 23/Sep/14 ]

Fan Yong,
Are you able to fix this test so that it can be re-enabled?
Thank you!

Comment by nasf (Inactive) [ 27/Sep/14 ]

There is long store about Lustre user check permission. Generally:
1) Lustre requires client and MDS to share the same user/group database.

2) MDT gets the RPC user information from the RPC header. In theory, only fsuid is enough, MDT can get other fsgid, suppgid information according to the given fsuid via identity_upcall if 1) is guaranteed. But because of history features (such as remote client, and so on) reason, we allow the administrator to specify the upcall handler, and even disable the upcall, although it is NOT the recommended.

3) If the identity_upcall is disabled, then the client needs to transfer fsuid/fsgid/suppgid information via RPC header directly. Because of the RPC header is space limited, it is impossible to transfer all suppgid information from client to MDS, then the client only transfers the needed (for permission check) suppgid to MDS.

For most of operations, case 3) works, because the client knows which suppgid may be used (according to the target object's attribute), but it based on guessing, because if there is chown/chgrp from others by race, the client given suppgid may become useless. More further, if the client does not know which suppgid may be used, such as the case of this failure, then the client may get permission deny. It is a known defect, but not easy to be resolved because of our current RPC/permission framework. But disable identity_upcall is NOT recommended Lustre configure, and NOT normal Lustre usage, so I do not think this ticket should be a Lustre-2.7 blocker.

Comment by nasf (Inactive) [ 21/Oct/14 ]

I am thinking that whether we should allow the admin to disable the identity_upcall on MDS or not. Currently, it seems not a good idea. It will bring some security risks to the whole system since client is non-trustable, and also causes some permission issues like this one. I am not sure whether there are some users who really use such feature in their product system. If nobody care about that, why not forbid the admin to do that?

Andreas, what's your suggestion about that?

Comment by Andreas Dilger [ 24/Oct/14 ]

I think we definitely still need to allow the identity_upcall to be replaced. I don't know that we need to be able to turn it off.

Maybe it makes sense to add a console warning if the upcall is set to NONE, referring to this bug, and if nobody complains in a few releases we can remove the ability to disable the upcall entirely at some later time.

Comment by nasf (Inactive) [ 29/Oct/14 ]

According to our current security framework, all permission check will be done on MDS side, in spite of whether client has checked or not. So the key issue for this ticket is how to pack the supplementary group IDs from client to MDT. Before linux-2.6.4, the system allow at most 32 supplementary group IDs, such number has been increased to 65536 since linux-2.6.4. So it is impossible to pack all the supplementary group IDs from client to MDS within current Lustre RPC framework:
1) There is not enough space in RPC for that.
2) compared with the UID/GIG, supplementary group are seldom used in the permission checking. Transferring a lot of seldom used information is much overhead.

To resolve current trouble, a workable but not perfect solution is that: when the llite get -EACCES failure under above case, we can ask the llite to get_attr_by_name on the target object firstly, then pack the suppgid according to the target inode->i_gid to try open again.

Comment by nasf (Inactive) [ 29/Oct/14 ]

http://review.whamcloud.com/12476

Comment by Gerrit Updater [ 18/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12476/
Subject: LU-5423 llite: pack suppgid to MDS correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2bc5bcb7efa247fcd8cc65d013ffc9f6c33dd788

Comment by Andreas Dilger [ 05/Feb/15 ]

So, it seems that sanity-sec test_4 is in the ALWAYS_EXCEPT list in the test framework (envdefinitions=SANITY_SEC_EXCEPT="4") so this test was never run for the patch that was landed:

https://testing.hpdd.intel.com/test_sets/7f4b4654-8070-11e4-ac36-5254006e85c2

and is skipped in all currently running tests.

I've submitted another patch to add test 4 to sanity-sec.sh ALWAYS_EXCEPT list. Once this is landed, please submit a new TEI ticket to remove the SANITY_SEC_EXCEPT="4" exception in the test environment (link new TEI ticket to TEI-2447 which was filed to disable the test), then remove this exception again along with a Test-Parameters: line that runs sanity-sec several times to verify it is working.

Comment by Andreas Dilger [ 05/Feb/15 ]

Patch to exclude test_4 again: http://review.whamcloud.com/13657

Comment by nasf (Inactive) [ 05/Feb/15 ]

Why not remove SANITY_SEC_EXCEPT="4" from the auto-test configs directly? I have verified "ONLY=4 sh sanity-sec.sh", it works well.

Comment by Andreas Dilger [ 06/Feb/15 ]

The reason not to immediately remove test 4 from SANITY_SEC_EXCEPT in the test framework is in case it has broken since the last time it was tested. This test hasn't been run in several months, since it was disabled in August, and even the fix patch wasn't actually running this test. Even though you tested it on your own, it may behave differently in our test system, or it may have broken again since your patch was last tested.

By pushing the patch to add test 4 to ALWAYS_EXCEPT in sanity-sec first then we could have safely removed the SANITY_SEC_EXCEPT a week later, and then run a few sanity-sec tests on master to verify it was still working properly before it was re-enabled for everyone. We've already had a series of other recent test problems that interrupted testing for everyone, and since it wasn't very important to enable this test immediately (it has already off for 5 months) it would have been better to do things in the safe manner.

Now that the patch to re-enable the test has already landed to autotest, we can only wait and see if there will be test failures or not, and possibly have to revert that change again after other people's patches have started failing. Let's hope it works out ok.

Comment by Andreas Dilger [ 12/Feb/15 ]

I see that there were 6 test failures in sanity-sec test_4 in the past week due to "setgroups (2)", so unless they are all caused by patches being tested on a branch that does not have this fix, it is likely that this problem was not fixed by the original patch and it was just never noticed because this test was not being run:
https://testing.hpdd.intel.com/sub_tests/cf8eea16-aec5-11e4-bd9a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/5dd8fad8-b1f4-11e4-bf90-5254006e85c2
https://testing.hpdd.intel.com/test_sets/4f314698-b0f0-11e4-865f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/95b93f46-affe-11e4-971a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/c81c005e-afc7-11e4-971a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f60cecf4-af53-11e4-91ac-5254006e85c2

It appears all of these failures are on "full" runs on master.

Comment by Gerrit Updater [ 16/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13778
Subject: LU-5423 mdt: debug patch for open permission deny
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c8a5c49cad07b9da549852ccdb52dc581bc427a6

Comment by nasf (Inactive) [ 29/Apr/15 ]

Yujian, is there any new full test result after your back-ported patch landed?

Comment by nasf (Inactive) [ 29/Apr/15 ]

After patch back-ported, s-s test_4 passed for full test.

Comment by Gerrit Updater [ 05/Jun/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/15153
Subject: LU-5423 tests: add version check to sanity-sec test 4
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bce4159f675cf226562234a1c774b1b1bd2573cc

Comment by Gerrit Updater [ 19/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15153/
Subject: LU-5423 tests: add version check to sanity-sec test 4
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c4440c3d40caa8b5c7ef750b9516255e9efe109e

Generated at Sat Feb 10 01:51:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.