Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8813

Kerberos: sanity and sanity-krb5 test suites fail on non-root user trying to touch file

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.9.0
    • Fix Version/s: Lustre 2.10.0
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      I’m having problems running some test suites with Kerberos enabled. For example, running sanity.sh on 2.8.60 + the patch http://review.whamcloud.com/#/c/23600/, and the test fails with

      # ./auster -v -k sanity --only 0a
      Started at Tue Nov  8 14:39:14 UTC 2016
      eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch
      Checking servers environments
      Checking clients eagle-48vm6.eagle.hpdd.intel.com environments
      Logging to local directory: /tmp/test_logs/2016-11-08/143914
      Client: Lustre version: 2.8.60_1_g35d09c7
      MDS: Lustre version: 2.8.60_1_g35d09c7
      OSS: Lustre version: 2.8.60_1_g35d09c7
      running: sanity ONLY=0a 
      run_suite sanity /usr/lib64/lustre/tests/sanity.sh
      -----============= acceptance-small: sanity ============----- Tue Nov  8 14:39:21 UTC 2016
      Running: bash /usr/lib64/lustre/tests/sanity.sh
      eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch
      Checking servers environments
      Checking clients eagle-48vm6.eagle.hpdd.intel.com environments
      Using TIMEOUT=20
      disable quota as required
      osd-ldiskfs.track_declares_assert=1
      osd-ldiskfs.track_declares_assert=1
      debug=-1
      running as uid/gid/euid/egid 500/500/500/500, groups:
       [touch] [/lustre/scratch/d0_runas_test/f7025]
      touch: cannot touch `/lustre/scratch/d0_runas_test/f7025': Permission denied
       sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
              Please set RUNAS_ID to some UID which exists on MDS and client or
              add user 500:500 on these nodes. 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4841:error()
        = /usr/lib64/lustre/tests/test-framework.sh:5670:check_runas_id()
        = /usr/lib64/lustre/tests/sanity.sh:126:main()
      Dumping lctl log to /tmp/test_logs/2016-11-08/143914/sanity..*.1478615970.log
      eagle-48vm1: Host key verification failed.
      eagle-48vm1: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      eagle-48vm1: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
      pdsh@eagle-48vm6: eagle-48vm1: ssh exited with exit code 12
      eagle-48vm2: Host key verification failed.
      eagle-48vm2: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      eagle-48vm2: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
      pdsh@eagle-48vm6: eagle-48vm2: ssh exited with exit code 12
      sanity returned 0
      Finished at Tue Nov  8 14:39:31 UTC 2016 in 17s
      ./auster: completed with rc 0
      

      The code that is failing in sanity.sh is

      # $RUNAS_ID may get set incorrectly somewhere else
      [ $UID -eq 0 -a $RUNAS_ID -eq 0 ] && error "\$RUNAS_ID set to 0, but \$UID is al
      so 0!"
      
      check_runas_id $RUNAS_ID $RUNAS_GID $RUNAS
      

      UID/GID 500 belongs to sanityusr and requested a Kerberos ticket before running sanity.sh:

      # su sanityusr
      bash-4.1$ klist
      Ticket cache: FILE:/tmp/krb5cc_500
      Default principal: sanityusr@CO.CFS
      
      Valid starting     Expires            Service principal
      11/08/16 14:38:48  11/09/16 14:38:48  krbtgt/CO.CFS@CO.CFS
      

      Note that CO.CFS is the realm being used.

      Since Lustre is not failing, it’s not surprising that there is nothing of interest in dmesg. For example, form the MGS/MDS:

      Lustre: DEBUG MARKER: -----============= acceptance-small: sanity ============----- Tue Nov 8 14:39:21 UTC 2016
      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
      

      Logs for this run are at https://testing.hpdd.intel.com/test_sets/b87e2568-a5f8-11e6-964e-5254006e85c2.
      More logs will be attached to this ticket.

      The last time I tested Kerberos and the above tests ran was tag 2.8.54. Since that time some flags have been added to lsvcgssd. I tried to call lsvcgssd two different ways; the way things worked in 2.8.54 as ‘/usr/sbin/lsvcgssd’ on all Lustre servers and, the new recommended way, for the MGS/MDS, ‘/usr/sbin/lsvcgssd -m -g -k –vvv’ and, for the OSS, ‘/usr/sbin/lsvcgssd -o -k –vvv’ (verbosity is optional). All tests were run with RHEL 6.8 (for some reason, Maloo reports it as el6.7)

      Confirmation that others are or are not experiencing this problem with these test suites and Kerberos would be helpful.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yong.fan nasf (Inactive)
                Reporter:
                jamesanunez James Nunez
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: