Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8813

Kerberos: sanity and sanity-krb5 test suites fail on non-root user trying to touch file

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.9.0
    • 3
    • 9223372036854775807

    Description

      I’m having problems running some test suites with Kerberos enabled. For example, running sanity.sh on 2.8.60 + the patch http://review.whamcloud.com/#/c/23600/, and the test fails with

      # ./auster -v -k sanity --only 0a
      Started at Tue Nov  8 14:39:14 UTC 2016
      eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch
      Checking servers environments
      Checking clients eagle-48vm6.eagle.hpdd.intel.com environments
      Logging to local directory: /tmp/test_logs/2016-11-08/143914
      Client: Lustre version: 2.8.60_1_g35d09c7
      MDS: Lustre version: 2.8.60_1_g35d09c7
      OSS: Lustre version: 2.8.60_1_g35d09c7
      running: sanity ONLY=0a 
      run_suite sanity /usr/lib64/lustre/tests/sanity.sh
      -----============= acceptance-small: sanity ============----- Tue Nov  8 14:39:21 UTC 2016
      Running: bash /usr/lib64/lustre/tests/sanity.sh
      eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch
      Checking servers environments
      Checking clients eagle-48vm6.eagle.hpdd.intel.com environments
      Using TIMEOUT=20
      disable quota as required
      osd-ldiskfs.track_declares_assert=1
      osd-ldiskfs.track_declares_assert=1
      debug=-1
      running as uid/gid/euid/egid 500/500/500/500, groups:
       [touch] [/lustre/scratch/d0_runas_test/f7025]
      touch: cannot touch `/lustre/scratch/d0_runas_test/f7025': Permission denied
       sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
              Please set RUNAS_ID to some UID which exists on MDS and client or
              add user 500:500 on these nodes. 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4841:error()
        = /usr/lib64/lustre/tests/test-framework.sh:5670:check_runas_id()
        = /usr/lib64/lustre/tests/sanity.sh:126:main()
      Dumping lctl log to /tmp/test_logs/2016-11-08/143914/sanity..*.1478615970.log
      eagle-48vm1: Host key verification failed.
      eagle-48vm1: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      eagle-48vm1: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
      pdsh@eagle-48vm6: eagle-48vm1: ssh exited with exit code 12
      eagle-48vm2: Host key verification failed.
      eagle-48vm2: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      eagle-48vm2: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
      pdsh@eagle-48vm6: eagle-48vm2: ssh exited with exit code 12
      sanity returned 0
      Finished at Tue Nov  8 14:39:31 UTC 2016 in 17s
      ./auster: completed with rc 0
      

      The code that is failing in sanity.sh is

      # $RUNAS_ID may get set incorrectly somewhere else
      [ $UID -eq 0 -a $RUNAS_ID -eq 0 ] && error "\$RUNAS_ID set to 0, but \$UID is al
      so 0!"
      
      check_runas_id $RUNAS_ID $RUNAS_GID $RUNAS
      

      UID/GID 500 belongs to sanityusr and requested a Kerberos ticket before running sanity.sh:

      # su sanityusr
      bash-4.1$ klist
      Ticket cache: FILE:/tmp/krb5cc_500
      Default principal: sanityusr@CO.CFS
      
      Valid starting     Expires            Service principal
      11/08/16 14:38:48  11/09/16 14:38:48  krbtgt/CO.CFS@CO.CFS
      

      Note that CO.CFS is the realm being used.

      Since Lustre is not failing, it’s not surprising that there is nothing of interest in dmesg. For example, form the MGS/MDS:

      Lustre: DEBUG MARKER: -----============= acceptance-small: sanity ============----- Tue Nov 8 14:39:21 UTC 2016
      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
      

      Logs for this run are at https://testing.hpdd.intel.com/test_sets/b87e2568-a5f8-11e6-964e-5254006e85c2.
      More logs will be attached to this ticket.

      The last time I tested Kerberos and the above tests ran was tag 2.8.54. Since that time some flags have been added to lsvcgssd. I tried to call lsvcgssd two different ways; the way things worked in 2.8.54 as ‘/usr/sbin/lsvcgssd’ on all Lustre servers and, the new recommended way, for the MGS/MDS, ‘/usr/sbin/lsvcgssd -m -g -k –vvv’ and, for the OSS, ‘/usr/sbin/lsvcgssd -o -k –vvv’ (verbosity is optional). All tests were run with RHEL 6.8 (for some reason, Maloo reports it as el6.7)

      Confirmation that others are or are not experiencing this problem with these test suites and Kerberos would be helpful.

      Attachments

        Issue Links

          Activity

            [LU-8813] Kerberos: sanity and sanity-krb5 test suites fail on non-root user trying to touch file
            adilger Andreas Dilger added a comment - - edited

            James, a few related issues here.

            • is there an update to the manual and the man pages for the changes to the required arguments?
            • does the Kerberos code still work without the new arguments? That would be desirable, since it avoids the need to change the daemon startup together with doing an upgrade/downgrade.
            • it would be nice to add a "-d|--debug" option to l_getidentity like there was in 1.8 l_getgroups to run in debug mode instead of having to set L_GETIDENTITY_TEST=1 first, which is really awkward. The "-d" option was used instead of having to specify the mdtname argument when testing from userspace.

            It looks like this problem was introduced by patch http://review.whamcloud.com/19789 "LU-6971 cleanup: not support remote client anymore". It seems we need to add back some compatibility to ignore the "rmtacl" line in parse_perm_line().

            adilger Andreas Dilger added a comment - - edited James, a few related issues here. is there an update to the manual and the man pages for the changes to the required arguments? does the Kerberos code still work without the new arguments? That would be desirable, since it avoids the need to change the daemon startup together with doing an upgrade/downgrade. it would be nice to add a " -d|--debug " option to l_getidentity like there was in 1.8 l_getgroups to run in debug mode instead of having to set L_GETIDENTITY_TEST=1 first, which is really awkward. The " -d " option was used instead of having to specify the mdtname argument when testing from userspace. It looks like this problem was introduced by patch http://review.whamcloud.com/19789 " LU-6971 cleanup: not support remote client anymore". It seems we need to add back some compatibility to ignore the " rmtacl " line in parse_perm_line() .

            James,

            How is the /etc/lustre/perm.conf generated? would you please to remove the lines with "rmtacl" from such configuration? Thanks!

            yong.fan nasf (Inactive) added a comment - James, How is the /etc/lustre/perm.conf generated? would you please to remove the lines with "rmtacl" from such configuration? Thanks!
            pjones Peter Jones added a comment -

            Fan Yong

            Could you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Fan Yong Could you please look into this issue? Thanks Peter
            jamesanunez James Nunez (Inactive) added a comment - - edited

            Looks like http://review.whamcloud.com/19789 removed 'rmtacl' from the list of known/allowed perm_types, which is causing l_getidentity to fail. Seems like 'rmtacl' is still used:

            # more /etc/lustre/perm.conf 
            * 500 rmtacl
            * 501 rmtacl
            * 0 setgid
            
            jamesanunez James Nunez (Inactive) added a comment - - edited Looks like http://review.whamcloud.com/19789 removed 'rmtacl' from the list of known/allowed perm_types, which is causing l_getidentity to fail. Seems like 'rmtacl' is still used: # more /etc/lustre/perm.conf * 500 rmtacl * 501 rmtacl * 0 setgid

            Thanks for the comment Andrew. It looks like the identity upcall is not working. On the MDS, for master, I get:

            # L_GETIDENTITY_TEST=1 l_getidentity scratch-MDT0000 500
            unkown type: rmtacl
            l_getidentity[14812]: invalid perm rmtacl
            l_getidentity[14812]: parse line * 500 rmtacl
             failed!
            l_getidentity[14812]: failed to get identity for uid 500: Invalid argument
            

            For 2.8.54, I see:

            # L_GETIDENTITY_TEST=1 l_getidentity scratch-MDT0000 500
            uid=500 gid=500
            permissions:
              nid			perm
              0xffffffffffffffff	0x8
            
            jamesanunez James Nunez (Inactive) added a comment - Thanks for the comment Andrew. It looks like the identity upcall is not working. On the MDS, for master, I get: # L_GETIDENTITY_TEST=1 l_getidentity scratch-MDT0000 500 unkown type: rmtacl l_getidentity[14812]: invalid perm rmtacl l_getidentity[14812]: parse line * 500 rmtacl failed! l_getidentity[14812]: failed to get identity for uid 500: Invalid argument For 2.8.54, I see: # L_GETIDENTITY_TEST=1 l_getidentity scratch-MDT0000 500 uid=500 gid=500 permissions: nid perm 0xffffffffffffffff 0x8

            Based on the logs, I would first check if the identity upcall works fine, e.g.
            mds # L_GETIDENTITY_TEST=1 l_getidentity scratch-MDT0000 500

            panda Andrew Perepechko added a comment - Based on the logs, I would first check if the identity upcall works fine, e.g. mds # L_GETIDENTITY_TEST=1 l_getidentity scratch-MDT0000 500

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: