Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.9.0
-
3
-
9223372036854775807
Description
I’m having problems running some test suites with Kerberos enabled. For example, running sanity.sh on 2.8.60 + the patch http://review.whamcloud.com/#/c/23600/, and the test fails with
# ./auster -v -k sanity --only 0a Started at Tue Nov 8 14:39:14 UTC 2016 eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch Checking servers environments Checking clients eagle-48vm6.eagle.hpdd.intel.com environments Logging to local directory: /tmp/test_logs/2016-11-08/143914 Client: Lustre version: 2.8.60_1_g35d09c7 MDS: Lustre version: 2.8.60_1_g35d09c7 OSS: Lustre version: 2.8.60_1_g35d09c7 running: sanity ONLY=0a run_suite sanity /usr/lib64/lustre/tests/sanity.sh -----============= acceptance-small: sanity ============----- Tue Nov 8 14:39:21 UTC 2016 Running: bash /usr/lib64/lustre/tests/sanity.sh eagle-48vm6.eagle.hpdd.intel.com: Checking config lustre mounted on /lustre/scratch Checking servers environments Checking clients eagle-48vm6.eagle.hpdd.intel.com environments Using TIMEOUT=20 disable quota as required osd-ldiskfs.track_declares_assert=1 osd-ldiskfs.track_declares_assert=1 debug=-1 running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/lustre/scratch/d0_runas_test/f7025] touch: cannot touch `/lustre/scratch/d0_runas_test/f7025': Permission denied sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500. Please set RUNAS_ID to some UID which exists on MDS and client or add user 500:500 on these nodes. Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4841:error() = /usr/lib64/lustre/tests/test-framework.sh:5670:check_runas_id() = /usr/lib64/lustre/tests/sanity.sh:126:main() Dumping lctl log to /tmp/test_logs/2016-11-08/143914/sanity..*.1478615970.log eagle-48vm1: Host key verification failed. eagle-48vm1: rsync: connection unexpectedly closed (0 bytes received so far) [sender] eagle-48vm1: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] pdsh@eagle-48vm6: eagle-48vm1: ssh exited with exit code 12 eagle-48vm2: Host key verification failed. eagle-48vm2: rsync: connection unexpectedly closed (0 bytes received so far) [sender] eagle-48vm2: rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] pdsh@eagle-48vm6: eagle-48vm2: ssh exited with exit code 12 sanity returned 0 Finished at Tue Nov 8 14:39:31 UTC 2016 in 17s ./auster: completed with rc 0
The code that is failing in sanity.sh is
# $RUNAS_ID may get set incorrectly somewhere else [ $UID -eq 0 -a $RUNAS_ID -eq 0 ] && error "\$RUNAS_ID set to 0, but \$UID is al so 0!" check_runas_id $RUNAS_ID $RUNAS_GID $RUNAS
UID/GID 500 belongs to sanityusr and requested a Kerberos ticket before running sanity.sh:
# su sanityusr bash-4.1$ klist Ticket cache: FILE:/tmp/krb5cc_500 Default principal: sanityusr@CO.CFS Valid starting Expires Service principal 11/08/16 14:38:48 11/09/16 14:38:48 krbtgt/CO.CFS@CO.CFS
Note that CO.CFS is the realm being used.
Since Lustre is not failing, it’s not surprising that there is nothing of interest in dmesg. For example, form the MGS/MDS:
Lustre: DEBUG MARKER: -----============= acceptance-small: sanity ============----- Tue Nov 8 14:39:21 UTC 2016 Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /lustre/scratch/d0_runas_test as UID 500.
Logs for this run are at https://testing.hpdd.intel.com/test_sets/b87e2568-a5f8-11e6-964e-5254006e85c2.
More logs will be attached to this ticket.
The last time I tested Kerberos and the above tests ran was tag 2.8.54. Since that time some flags have been added to lsvcgssd. I tried to call lsvcgssd two different ways; the way things worked in 2.8.54 as ‘/usr/sbin/lsvcgssd’ on all Lustre servers and, the new recommended way, for the MGS/MDS, ‘/usr/sbin/lsvcgssd -m -g -k –vvv’ and, for the OSS, ‘/usr/sbin/lsvcgssd -o -k –vvv’ (verbosity is optional). All tests were run with RHEL 6.8 (for some reason, Maloo reports it as el6.7)
Confirmation that others are or are not experiencing this problem with these test suites and Kerberos would be helpful.