Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8813

Kerberos: sanity and sanity-krb5 test suites fail on non-root user trying to touch file

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.9.0
    • 3
    • 9223372036854775807

    Description

      I’m having problems running some test suites with Kerberos enabled. For example, running sanity.sh on 2.8.60 + the patch http://review.whamcloud.com/#/c/23600/, and the test fails with

      # ./auster -v -k sanity --only 0a
      Started at Tue Nov  8 14:39:14 UTC 2016
      eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch
      Checking servers environments
      Checking clients eagle-48vm6.eagle.hpdd.intel.com environments
      Logging to local directory: /tmp/test_logs/2016-11-08/143914
      Client: Lustre version: 2.8.60_1_g35d09c7
      MDS: Lustre version: 2.8.60_1_g35d09c7
      OSS: Lustre version: 2.8.60_1_g35d09c7
      running: sanity ONLY=0a 
      run_suite sanity /usr/lib64/lustre/tests/sanity.sh
      -----============= acceptance-small: sanity ============----- Tue Nov  8 14:39:21 UTC 2016
      Running: bash /usr/lib64/lustre/tests/sanity.sh
      eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch
      Checking servers environments
      Checking clients eagle-48vm6.eagle.hpdd.intel.com environments
      Using TIMEOUT=20
      disable quota as required
      osd-ldiskfs.track_declares_assert=1
      osd-ldiskfs.track_declares_assert=1
      debug=-1
      running as uid/gid/euid/egid 500/500/500/500, groups:
       [touch] [/lustre/scratch/d0_runas_test/f7025]
      touch: cannot touch `/lustre/scratch/d0_runas_test/f7025': Permission denied
       sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
              Please set RUNAS_ID to some UID which exists on MDS and client or
              add user 500:500 on these nodes. 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4841:error()
        = /usr/lib64/lustre/tests/test-framework.sh:5670:check_runas_id()
        = /usr/lib64/lustre/tests/sanity.sh:126:main()
      Dumping lctl log to /tmp/test_logs/2016-11-08/143914/sanity..*.1478615970.log
      eagle-48vm1: Host key verification failed.
      eagle-48vm1: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      eagle-48vm1: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
      pdsh@eagle-48vm6: eagle-48vm1: ssh exited with exit code 12
      eagle-48vm2: Host key verification failed.
      eagle-48vm2: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      eagle-48vm2: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
      pdsh@eagle-48vm6: eagle-48vm2: ssh exited with exit code 12
      sanity returned 0
      Finished at Tue Nov  8 14:39:31 UTC 2016 in 17s
      ./auster: completed with rc 0
      

      The code that is failing in sanity.sh is

      # $RUNAS_ID may get set incorrectly somewhere else
      [ $UID -eq 0 -a $RUNAS_ID -eq 0 ] && error "\$RUNAS_ID set to 0, but \$UID is al
      so 0!"
      
      check_runas_id $RUNAS_ID $RUNAS_GID $RUNAS
      

      UID/GID 500 belongs to sanityusr and requested a Kerberos ticket before running sanity.sh:

      # su sanityusr
      bash-4.1$ klist
      Ticket cache: FILE:/tmp/krb5cc_500
      Default principal: sanityusr@CO.CFS
      
      Valid starting     Expires            Service principal
      11/08/16 14:38:48  11/09/16 14:38:48  krbtgt/CO.CFS@CO.CFS
      

      Note that CO.CFS is the realm being used.

      Since Lustre is not failing, it’s not surprising that there is nothing of interest in dmesg. For example, form the MGS/MDS:

      Lustre: DEBUG MARKER: -----============= acceptance-small: sanity ============----- Tue Nov 8 14:39:21 UTC 2016
      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
      

      Logs for this run are at https://testing.hpdd.intel.com/test_sets/b87e2568-a5f8-11e6-964e-5254006e85c2.
      More logs will be attached to this ticket.

      The last time I tested Kerberos and the above tests ran was tag 2.8.54. Since that time some flags have been added to lsvcgssd. I tried to call lsvcgssd two different ways; the way things worked in 2.8.54 as ‘/usr/sbin/lsvcgssd’ on all Lustre servers and, the new recommended way, for the MGS/MDS, ‘/usr/sbin/lsvcgssd -m -g -k –vvv’ and, for the OSS, ‘/usr/sbin/lsvcgssd -o -k –vvv’ (verbosity is optional). All tests were run with RHEL 6.8 (for some reason, Maloo reports it as el6.7)

      Confirmation that others are or are not experiencing this problem with these test suites and Kerberos would be helpful.

      Attachments

        Issue Links

          Activity

            [LU-8813] Kerberos: sanity and sanity-krb5 test suites fail on non-root user trying to touch file

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25584/
            Subject: LU-8813 gss: limit the number of error messages in logs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4ed67efd13cddd7ec41d29e853601ce862aaae9e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25584/ Subject: LU-8813 gss: limit the number of error messages in logs Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4ed67efd13cddd7ec41d29e853601ce862aaae9e

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25584
            Subject: LU-8813 gss: limit the number of error messages in logs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9a04aec1e2692cf32aedadee0bf745b657724012

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25584 Subject: LU-8813 gss: limit the number of error messages in logs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a04aec1e2692cf32aedadee0bf745b657724012
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23925/
            Subject: LU-8813 gss: allow svcgssd to start without "-k"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: faf53524cdb90eee45e9425e529a7a6868679c56

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23925/ Subject: LU-8813 gss: allow svcgssd to start without "-k" Project: fs/lustre-release Branch: master Current Patch Set: Commit: faf53524cdb90eee45e9425e529a7a6868679c56

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/23925
            Subject: LU-8813 gss: allow svcgssd to start without "-k"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6f63a934d3f771c479f296b203f0f717cfae5313

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/23925 Subject: LU-8813 gss: allow svcgssd to start without "-k" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6f63a934d3f771c479f296b203f0f717cfae5313

            Reopen to fix compatibility with svcgssd not being passed -k option for existing Kerberos configurations.

            adilger Andreas Dilger added a comment - Reopen to fix compatibility with svcgssd not being passed -k option for existing Kerberos configurations.

            The patch has been landed to master.

            yong.fan nasf (Inactive) added a comment - The patch has been landed to master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23667/
            Subject: LU-8813 utils: l_getidentity compatibility
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d385685a92668241d8c802f33f2e5497d9a7ea5a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23667/ Subject: LU-8813 utils: l_getidentity compatibility Project: fs/lustre-release Branch: master Current Patch Set: Commit: d385685a92668241d8c802f33f2e5497d9a7ea5a

            James, please be aware of test interop issues when adding the new flags to the lsvcgssd startup. While I can understand that this may be necessary for SSK testing, it isn't clear to me what the need is for KRB testing, and this highlights that existing user deployments with Kerberos may start to fail in a similar manner, so I'm glad we caught this before release.

            The addition of the "-k" option should not be required for Kerberos configurations, to maintain compatibility with existing environments, IMHO. My thought was to have the lack of "-k" set the krb_enabled if neither sk_enabled or null_enabled is no yet set, but print a warning like:

             svcgssd: no "-k", "-s", or "-z" option given.  Assume "-k" for compatibility reasons.  
            

            or similar to avoid confusing error messages.

            adilger Andreas Dilger added a comment - James, please be aware of test interop issues when adding the new flags to the lsvcgssd startup. While I can understand that this may be necessary for SSK testing, it isn't clear to me what the need is for KRB testing, and this highlights that existing user deployments with Kerberos may start to fail in a similar manner, so I'm glad we caught this before release. The addition of the "-k" option should not be required for Kerberos configurations, to maintain compatibility with existing environments, IMHO. My thought was to have the lack of "-k" set the krb_enabled if neither sk_enabled or null_enabled is no yet set, but print a warning like: svcgssd: no "-k", "-s", or "-z" option given. Assume "-k" for compatibility reasons. or similar to avoid confusing error messages.
            jamesanunez James Nunez (Inactive) added a comment - - edited

            I understand the remaining issues and they are minor issues in the sanity-krb5 test suite. Some tests in sanity-krb5 stop and start the lsvcgssd and restart with only the '-v' option. Once I added the correct flags to these calls for the MDS and for OSS, the full test suite runs with only one failure. I will upload a patch to modify the calls lsvcgssd in sanity-krb5.

            Thank you Fan Yong and Andrew (and Andreas) for your quick reaction and comments. Your time spent and willingness to help is appreciated.

            jamesanunez James Nunez (Inactive) added a comment - - edited I understand the remaining issues and they are minor issues in the sanity-krb5 test suite. Some tests in sanity-krb5 stop and start the lsvcgssd and restart with only the '-v' option. Once I added the correct flags to these calls for the MDS and for OSS, the full test suite runs with only one failure. I will upload a patch to modify the calls lsvcgssd in sanity-krb5. Thank you Fan Yong and Andrew (and Andreas) for your quick reaction and comments. Your time spent and willingness to help is appreciated.

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: