[LU-14443] review-dne-ssk test session failed: Error checking ski of cli2mdt Created: 18/Feb/21  Updated: 25/Feb/21  Resolved: 25/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14424 write performance regression in Lustr... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/86c32e36-1722-48ae-a3e8-efee318e647c

checking cli2mdt...found 0/8 ski connections
checking cli2mdt...found 0/8 ski connections
checking cli2mdt...found 0/8 ski connections
Error checking ski of cli2mdt: expect 8, actual 0
CMD: trevis-211vm7,trevis-211vm8,trevis-211vm9 keyctl show
Session Keyring
 949457656 --alswrv      0     0  keyring: _ses
 739709307 ----s-rv      0     0   \_ user: invocation_id
Session Keyring
 751217567 --alswrv      0     0  keyring: _ses
 890051499 ----s-rv      0     0   \_ user: invocation_id
Session Keyring
 234519632 --alswrv      0     0  keyring: _ses
 157615560 ----s-rv      0     0   \_ user: invocation_id


 Comments   
Comment by Jian Yu [ 18/Feb/21 ]

The failure occurred consistently today. It's blocking the patch testing on master branch.

Comment by Jian Yu [ 18/Feb/21 ]

The review-dne-ssk test session passed on 2021-02-13. There are no Maloo reports between 2021-02-13 and 2021-02-17.
On master branch, only the following one commit landed since 2021-02-13:

  • LU-14424 Revert "LU-9679 osc: simplify osc_extent_find()" (details / gitweb)
Comment by Andreas Dilger [ 18/Feb/21 ]

Is it possible that the clocks are out of sync in the test cluster since the reboot?

LustreError: 25556:0:(gss_keyring.c:1445:gss_kt_update()) negotiation: rpc err 0, gss err d0000
LustreError: 25556:0:(gss_keyring.c:1445:gss_kt_update()) Skipped 520 previous similar messages
Lustre: 25556:0:(sec_gss.c:315:cli_ctx_expire()) ctx 000000002d43236a(0->lustre-OST0004_UUID) get expired: 1613618660(+37s)
Lustre: 25556:0:(sec_gss.c:315:cli_ctx_expire()) Skipped 520 previous similar messages
Lustre: 7722:0:(sec_gss.c:1228:gss_cli_ctx_fini_common()) gss.keyring@00000000e20fbcfe: destroy ctx 000000002d43236a(0->lustre-OST0004_UUID)
Comment by Andreas Dilger [ 18/Feb/21 ]

And later in the client logs:

LustreError: 46837:0:(file.c:4747:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108
LustreError: 46837:0:(file.c:4747:ll_inode_revalidate_fini()) Skipped 3 previous similar messages
LustreError: 46851:0:(gss_keyring.c:864:gss_sec_lookup_ctx_kr()) failed request key: -126
LustreError: 46851:0:(gss_keyring.c:864:gss_sec_lookup_ctx_kr()) Skipped 7 previous similar messages
LustreError: 46851:0:(sec.c:452:sptlrpc_req_get_ctx()) req 000000009c749cf6: fail to get context
LustreError: 46851:0:(sec.c:452:sptlrpc_req_get_ctx()) 
Comment by Gerrit Updater [ 19/Feb/21 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41695
Subject: LU-14443 test: run review-dne-ssk
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0513a57039e62636dbf9ef29dc15e1d0f33ea294

Comment by Sebastien Buisson [ 19/Feb/21 ]

I have pushed a test patch to see if the failures are due to commit

b592f75446 LU-14424 Revert "LU-9679 osc: simplify osc_extent_find()"

But it is quite unlikely. Failures seem to be environmental and have started occurring after the lab power failure. But I checked on trevis, I cannot see out of sync clocks.

Comment by James Nunez (Inactive) [ 19/Feb/21 ]

I’ve run review-dne-ssk issues for patch https://review.whamcloud.com/#/c/40884/ over the past few months are part of validating Lustre on RHEL8.3. The testing shows/implies

1. The review-dne-ssk issue is not cuased by the Linux distro because the tests run on Feb 4 didn’t hit this issue, but the tests from Feb 19 do hit this issue and they were both using same kernel version of RHEL8.3.
Yet, there is a lustre-mr test run that passed review-dne-ssk recently. The MR patch is base on 2.14.0-RC3, but runs RHEL7.8; https://testing.whamcloud.com/test_sessions/686d5e5a-5798-4403-a6e3-c8445ee8b177.

2. The review-dne-ssk issue is not a Lustre issue because the 40884 patch uses the same parent/version of Lustre from January 27; I did not rebase between Feb 4 and Feb 19. As noted, review-dne-ssk passed on Feb 4 and failed on Feb 19.

That's the data I have and my, possibly faulty, thoughts on this issue.

Comment by Andreas Dilger [ 19/Feb/21 ]

The cause of the failure seems fairly clear in the following Maloo search:
https://testing.whamcloud.com/test_sessions?test_groups%5B%5D=review-dne-ssk&start_date=2021-02-12&end_date=2021-02-17&source=test_sessions#redirect

The test has never passed on RHEL8.3 since 2021-02-17, but passed consistently with RHEL7.8 until that date.

Comment by Andreas Dilger [ 19/Feb/21 ]

It looks like there were some el8.3 passes until 2021-02-04, but none since then:

https://testing.whamcloud.com/test_sessions?client_distribution_type_id=309da983-628c-4d7a-bd36-5aeee0b55610&test_groups%5B%5D=review-dne-ssk&start_date=2020-12-01&end_date=2021-02-17&source=test_sessions#redirect

Comment by James Nunez (Inactive) [ 20/Feb/21 ]

Update:
review-dne-ssk with RHEL8.3 server with RHEL7.8 and 7.9 clients both failed; https://testing.whamcloud.com/test_sessions/f1fa92d8-bec3-427f-8662-9b37d3880c81 and https://testing.whamcloud.com/test_sessions/19386d1f-9133-46c8-b47f-a3500c99b168 .

RHEL7.8 server/client and RHEL7.9 server/client test sessions both passed with pre-RC Lustre code; https://testing.whamcloud.com/test_sessions/985a61e8-d552-4a69-a9df-76a04a677d4f and https://testing.whamcloud.com/test_sessions/3f446025-55f9-4e7b-95c4-e6de9b772905

I haven't tested 2.14.0 GA, but will submit those tests.

Comment by Andreas Dilger [ 22/Feb/21 ]

From James:

RHEL8.3 clients with 7.8 adn 7.9 servers performed better the 8.3 clients and 8.3 servers, but all test suites still fail. With 8.3/8.3, all test suites fail all tests (sanity, recovery-small, sanity-sec) in fact, the test suites don't even run any of the individual tests. All 8.3/7.8 or 7.9 test suites fail, but all test suites run some individual tests and some of those tests pass. I haven't looked at these results yet, but see https://testing.whamcloud.com/test_sessions/c5ee2ab6-8bf2-4a68-a851-745a371beb55 and https://testing.whamcloud.com/test_sessions/f36771b6-2e6f-4bdd-a8e2-1d9bb5a0dfdb

Comment by Sebastien Buisson [ 22/Feb/21 ]

Thank you guys, this is helpful.

It is like something on the el8 distro used on the test nodes has been "activated" only after the power outage. I still do not know what it is, as at least the kernel version before and after is exactly the same.

Thanks to the livedebug test parameter I used for patch https://review.whamcloud.com/41695, I managed to have nodes allocated on Trevis for me, so I am going to try to reproduce this issue. Looking at the logs you pointed to, the first concerning message for me is the inability for the client to use the SSK key:

LustreError: 46851:0:(gss_keyring.c:864:gss_sec_lookup_ctx_kr()) failed request key: -126

given that

#define ENOKEY          126     /* Required key not available */
Comment by Sebastien Buisson [ 22/Feb/21 ]

As a quick update, I tried to reproduce manually with nodes allocated on trevis (trevis-209vm[1-5]), but I did not manage to so far. I was able to properly setup a Lustre file system with SHARED_KEY enabled, by using llmount.sh with a cfg file that I have been using regularly on this cluster.

Is there a way to get the cfg file used by Maloo in the lustre-initialization phase of review-dne-ssk test group?

It is also interesting to note that although review-dne-ssk has been failing consistently on RHEL8.3 since the power outage in the lab, review-dne-selinux-ssk is passing without any issue. And as far as I know, the only difference between those 2 test groups is that SELinux is enforced in addition to SSK.

Comment by James Nunez (Inactive) [ 22/Feb/21 ]

> Is there a way to get the cfg file used by Maloo in the lustre-initialization phase of review-dne-ssk test group?

I was talking to Charlie about this and ... Yes, you can view the configuration. The configuration can change based on what test group is running. So, it's best to look a lustre-initialization test session results in the lustre-initialization-1.autotest log. The environment parameters are displayed in that log starting with the line

cat /root/autotest_config.sh

For example, lookin at the lustre-initialization autotest log for a review-dne-ssk test session that failed https://testing.whamcloud.com/test_sessions/816df1e1-ea66-4ef9-b388-7e6b41ab67fc, we see

2021-02-19T15:00:45 cat /root/autotest_config.sh
2021-02-19T15:00:45 #!/bin/bash
2021-02-19T15:00:45 #Auto Generated By Whamcloud Autotest
2021-02-19T15:00:45 #Key Exports
2021-02-19T15:00:45 export mgs_HOST=onyx-44vm4
2021-02-19T15:00:45 export mds_HOST=onyx-44vm4
2021-02-19T15:00:45 export MGSDEV=/dev/lvm-Role_MDS/P1
2021-02-19T15:00:45 export MDSDEV=/dev/lvm-Role_MDS/P1
2021-02-19T15:00:45 export mds1_HOST=onyx-44vm4
2021-02-19T15:00:45 export MDSDEV1=/dev/lvm-Role_MDS/P1
2021-02-19T15:00:45 export mds2_HOST=onyx-44vm5
2021-02-19T15:00:45 export MDSDEV2=/dev/lvm-Role_MDS/P2
2021-02-19T15:00:45 export mds3_HOST=onyx-44vm4
2021-02-19T15:00:45 export MDSDEV3=/dev/lvm-Role_MDS/P3
2021-02-19T15:00:45 export mds4_HOST=onyx-44vm5
2021-02-19T15:00:45 export MDSDEV4=/dev/lvm-Role_MDS/P4
2021-02-19T15:00:45 export MDSCOUNT=4
2021-02-19T15:00:45 export MDSSIZE=2097152
2021-02-19T15:00:45 export MGSSIZE=2097152
2021-02-19T15:00:45 export MDSFSTYPE=ldiskfs
2021-02-19T15:00:45 export MGSFSTYPE=ldiskfs
2021-02-19T15:00:45 export ost_HOST=onyx-44vm3
2021-02-19T15:00:45 export ost1_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV1=/dev/lvm-Role_OSS/P1
2021-02-19T15:00:45 export ost2_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV2=/dev/lvm-Role_OSS/P2
2021-02-19T15:00:45 export ost3_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV3=/dev/lvm-Role_OSS/P3
2021-02-19T15:00:45 export ost4_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV4=/dev/lvm-Role_OSS/P4
2021-02-19T15:00:45 export ost5_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV5=/dev/lvm-Role_OSS/P5
2021-02-19T15:00:45 export ost6_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV6=/dev/lvm-Role_OSS/P6
2021-02-19T15:00:45 export ost7_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV7=/dev/lvm-Role_OSS/P7
2021-02-19T15:00:45 export ost8_HOST=onyx-44vm3
2021-02-19T15:00:45 export OSTDEV8=/dev/lvm-Role_OSS/P8
2021-02-19T15:00:45 # some setup for conf-sanity test 24a, 24b, 33a
2021-02-19T15:00:45 export fs2mds_DEV=/dev/lvm-Role_MDS/S1
2021-02-19T15:00:45 export fs2ost_DEV=/dev/lvm-Role_OSS/S1
2021-02-19T15:00:45 export fs3ost_DEV=/dev/lvm-Role_OSS/S2
2021-02-19T15:00:45 export RCLIENTS="onyx-44vm2"
2021-02-19T15:00:45 export OSTCOUNT=8
2021-02-19T15:00:45 export NETTYPE=tcp
2021-02-19T15:00:45 export OSTSIZE=10051911
2021-02-19T15:00:45 export OSTFSTYPE=ldiskfs
2021-02-19T15:00:45 export FSTYPE=ldiskfs
2021-02-19T15:00:45 export LOGDIR=/autotest/autotest-1/2021-02-19/lustre-reviews_review-dne-ssk_79237_1_103_816df1e1-ea66-4ef9-b388-7e6b41ab67fc
2021-02-19T15:00:45 export SHARED_DIRECTORY=/autotest/autotest-1/2021-02-19/lustre-reviews_review-dne-ssk_79237_1_103_816df1e1-ea66-4ef9-b388-7e6b41ab67fc/shared_dir
2021-02-19T15:00:45 export SHARED_KEY=true
2021-02-19T15:00:45 export PDSH="pdsh -t 120 -S -Rmrsh -w"
2021-02-19T15:00:45 # Adding contents of /opt/autotest/releases/02_08_2021/external/mecturk/mecturk-ncli.sh
2021-02-19T15:00:45 # Entries above here are created by configure_cluster.rb
2021-02-19T15:00:45 # Entries below here come from mecturk-ncli.sh
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # This config file should only contain entries for non-default
2021-02-19T15:00:45 # values that override settings in ncli.sh or local.sh.
2021-02-19T15:00:45 
2021-02-19T15:00:45 VERBOSE=true
2021-02-19T15:00:45 
2021-02-19T15:00:45 # override local.sh as it does not point to the powerman host
2021-02-19T15:00:45 POWER_DOWN=${POWER_DOWN:-"powerman -h powerman --off"}
2021-02-19T15:00:45 POWER_UP=${POWER_UP:-"powerman -h powerman --on"}
2021-02-19T15:00:45 
2021-02-19T15:00:45 # non-standard ports for liblustre TCP connections
2021-02-19T15:00:45 export LNET_ACCEPT_PORT=7988
2021-02-19T15:00:45 export ACCEPTOR_PORT=7988
2021-02-19T15:00:45 
2021-02-19T15:00:45 # Check for wide striping.  This was added to local.sh for 2.6+
2021-02-19T15:00:45 [ $OSTCOUNT -gt 160 -a $MDSFSTYPE = "ldiskfs" ] &&
2021-02-19T15:00:45 	MDSOPT=$MDSOPT" --mkfsoptions='-O large_xattr -J size=4096'"
2021-02-19T15:00:45 
2021-02-19T15:00:45 # TT-430
2021-02-19T15:00:45 SERVER_FAILOVER_PERIOD=$((60 * 20))
2021-02-19T15:00:45 
2021-02-19T15:00:45 export RSYNC_RSH=rsh
2021-02-19T15:00:45 
2021-02-19T15:00:45 cbench_DIR=/usr/bin
2021-02-19T15:00:45 cnt_DIR=/opt/connectathon
2021-02-19T15:00:45 
2021-02-19T15:00:45 # Set-up shell environment for openmpi
2021-02-19T15:00:45 [ -r /etc/profile.d/openmpi.sh ] && . /etc/profile.d/openmpi.sh
2021-02-19T15:00:45 MPIRUN_OPTIONS="-mca boot ssh"
2021-02-19T15:00:45 [ "${NETTYPE}" = 'tcp' ] &&
2021-02-19T15:00:45     MPIRUN_OPTIONS="--mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh"
2021-02-19T15:00:45 
2021-02-19T15:00:45 # the ncli.sh config script includes local.sh in turn.
2021-02-19T15:00:45 . $LUSTRE/tests/cfg/ncli.sh
2021-02-19T15:00:45 (./run_test.sh:53): main
2021-02-19T15:00:45 echo '**************************************************************************************************************'
2021-02-19T15:00:45 **************************************************************************************************************
2021-02-19T15:00:45 (./run_test.sh:54): main
2021-02-19T15:00:45 echo ncli.sh
2021-02-19T15:00:45 ncli.sh
2021-02-19T15:00:45 (./run_test.sh:55): main
2021-02-19T15:00:45 echo '**************************************************************************************************************'
2021-02-19T15:00:45 **************************************************************************************************************
2021-02-19T15:00:45 (./run_test.sh:56): main
2021-02-19T15:00:45 cat /usr/lib64/lustre/tests/cfg/ncli.sh
2021-02-19T15:00:45 . $LUSTRE/tests/cfg/local.sh
2021-02-19T15:00:45 
2021-02-19T15:00:45 # For multiple clients testing, we need use the cfg/ncli.sh config file, and
2021-02-19T15:00:45 # only need specify the "RCLIENTS" variable. The "CLIENTS" and "CLIENTCOUNT"
2021-02-19T15:00:45 # variables are defined in init_clients_lists(), called from cfg/ncli.sh.
2021-02-19T15:00:45 CLIENT1=${CLIENT1:-$(hostname)}
2021-02-19T15:00:45 SINGLECLIENT=$CLIENT1
2021-02-19T15:00:45 RCLIENTS=${RCLIENTS:-""}
2021-02-19T15:00:45 
2021-02-19T15:00:45 init_clients_lists
2021-02-19T15:00:45 
2021-02-19T15:00:45 [ -n "$RCLIENTS" -a "$PDSH" = "no_dsh" ] &&
2021-02-19T15:00:45 	error "tests for remote clients $RCLIENTS needs pdsh != do_dsh " || true
2021-02-19T15:00:45 
2021-02-19T15:00:45 [ -n "$FUNCTIONS" ] && . $FUNCTIONS || true
2021-02-19T15:00:45 
2021-02-19T15:00:45 # for recovery scale tests
2021-02-19T15:00:45 # default boulder cluster iozone location
2021-02-19T15:00:45 export PATH=/opt/iozone/bin:$PATH
2021-02-19T15:00:45 
2021-02-19T15:00:45 LOADS=${LOADS:-"dd tar dbench iozone"}
2021-02-19T15:00:45 for i in $LOADS; do
2021-02-19T15:00:45 	[ -f $LUSTRE/tests/run_${i}.sh ] || error "incorrect load: $i"
2021-02-19T15:00:45 done
2021-02-19T15:00:45 CLIENT_LOADS=($LOADS)
2021-02-19T15:00:45 
2021-02-19T15:00:45 # This is used when testing on SLURM environment.
2021-02-19T15:00:45 # Test will use srun when SRUN_PARTITION is set
2021-02-19T15:00:45 SRUN=${SRUN:-$(which srun 2>/dev/null || true)}
2021-02-19T15:00:45 SRUN_PARTITION=${SRUN_PARTITION:-""}
2021-02-19T15:00:45 SRUN_OPTIONS=${SRUN_OPTIONS:-"-W 1800 -l -O"}
2021-02-19T15:00:45 (./run_test.sh:57): main
2021-02-19T15:00:45 echo '**************************************************************************************************************'
2021-02-19T15:00:45 **************************************************************************************************************
2021-02-19T15:00:45 (./run_test.sh:58): main
2021-02-19T15:00:45 echo local.sh
2021-02-19T15:00:45 local.sh
2021-02-19T15:00:45 (./run_test.sh:59): main
2021-02-19T15:00:45 echo '**************************************************************************************************************'
2021-02-19T15:00:45 **************************************************************************************************************
2021-02-19T15:00:45 (./run_test.sh:60): main
2021-02-19T15:00:45 cat /usr/lib64/lustre/tests/cfg/local.sh
2021-02-19T15:00:45 FSNAME=${FSNAME:-lustre}
2021-02-19T15:00:45 
2021-02-19T15:00:45 # facet hosts
2021-02-19T15:00:45 mds_HOST=${mds_HOST:-$(hostname)}
2021-02-19T15:00:45 mdsfailover_HOST=${mdsfailover_HOST}
2021-02-19T15:00:45 mgs_HOST=${mgs_HOST:-$mds_HOST}
2021-02-19T15:00:45 ost_HOST=${ost_HOST:-$(hostname)}
2021-02-19T15:00:45 ostfailover_HOST=${ostfailover_HOST}
2021-02-19T15:00:45 CLIENTS=""
2021-02-19T15:00:45 # FILESET variable is used by sanity.sh to verify fileset
2021-02-19T15:00:45 # feature, tests should pass even under subdirectory namespace.
2021-02-19T15:00:45 FILESET=${FILESET:-""}
2021-02-19T15:00:45 [[ -z "$FILESET" ]] || [[ "${FILESET:0:1}" = "/" ]] || FILESET="/$FILESET"
2021-02-19T15:00:45 
2021-02-19T15:00:45 TMP=${TMP:-/tmp}
2021-02-19T15:00:45 
2021-02-19T15:00:45 DAEMONSIZE=${DAEMONSIZE:-500}
2021-02-19T15:00:45 MDSCOUNT=${MDSCOUNT:-1}
2021-02-19T15:00:45 MDSDEVBASE=${MDSDEVBASE:-$TMP/${FSNAME}-mdt}
2021-02-19T15:00:45 MDSSIZE=${MDSSIZE:-250000}
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # Format options of facets can be specified with these variables:
2021-02-19T15:00:45 #
2021-02-19T15:00:45 #   - <facet_type>OPT
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # Arguments for "--mkfsoptions" shall be specified with these
2021-02-19T15:00:45 # variables:
2021-02-19T15:00:45 #
2021-02-19T15:00:45 #   - <fstype>_MKFS_OPTS
2021-02-19T15:00:45 #   - <facet_type>_FS_MKFS_OPTS
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # A number of other options have their own specific variables.  See
2021-02-19T15:00:45 # mkfs_opts().
2021-02-19T15:00:45 #
2021-02-19T15:00:45 MDSOPT=${MDSOPT:-}
2021-02-19T15:00:45 MDS_FS_MKFS_OPTS=${MDS_FS_MKFS_OPTS:-}
2021-02-19T15:00:45 MDS_MOUNT_OPTS=${MDS_MOUNT_OPTS:-}
2021-02-19T15:00:45 # <facet_type>_MOUNT_FS_OPTS is the mount options specified when formatting
2021-02-19T15:00:45 # the underlying device by argument "--mountfsoptions"
2021-02-19T15:00:45 MDS_MOUNT_FS_OPTS=${MDS_MOUNT_FS_OPTS:-}
2021-02-19T15:00:45 
2021-02-19T15:00:45 MGSSIZE=${MGSSIZE:-$MDSSIZE}
2021-02-19T15:00:45 MGSOPT=${MGSOPT:-}
2021-02-19T15:00:45 MGS_FS_MKFS_OPTS=${MGS_FS_MKFS_OPTS:-}
2021-02-19T15:00:45 MGS_MOUNT_OPTS=${MGS_MOUNT_OPTS:-}
2021-02-19T15:00:45 MGS_MOUNT_FS_OPTS=${MGS_MOUNT_FS_OPTS:-}
2021-02-19T15:00:45 
2021-02-19T15:00:45 OSTCOUNT=${OSTCOUNT:-2}
2021-02-19T15:00:45 OSTDEVBASE=${OSTDEVBASE:-$TMP/${FSNAME}-ost}
2021-02-19T15:00:45 OSTSIZE=${OSTSIZE:-400000}
2021-02-19T15:00:45 OSTOPT=${OSTOPT:-}
2021-02-19T15:00:45 OST_FS_MKFS_OPTS=${OST_FS_MKFS_OPTS:-}
2021-02-19T15:00:45 OST_MOUNT_OPTS=${OST_MOUNT_OPTS:-}
2021-02-19T15:00:45 OST_MOUNT_FS_OPTS=${OST_MOUNT_FS_OPTS:-}
2021-02-19T15:00:45 OST_INDEX_LIST=${OST_INDEX_LIST:-}
2021-02-19T15:00:45 # Can specify individual ost devs with
2021-02-19T15:00:45 # OSTDEV1="/dev/sda"
2021-02-19T15:00:45 # on specific hosts with
2021-02-19T15:00:45 # ost1_HOST="uml2"
2021-02-19T15:00:45 # ost1_JRN="/dev/sdb1"
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # For ZFS, ost devices can be specified via either or both of the following:
2021-02-19T15:00:45 # OSTZFSDEV1="${FSNAME}-ost1/ost1"
2021-02-19T15:00:45 # OSTDEV1="/dev/sdb1"
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # OST indices can be specified as follows:
2021-02-19T15:00:45 # OSTINDEX1="1"
2021-02-19T15:00:45 # OSTINDEX2="2"
2021-02-19T15:00:45 # OSTINDEX3="4"
2021-02-19T15:00:45 # ......
2021-02-19T15:00:45 # or
2021-02-19T15:00:45 # OST_INDEX_LIST="[1,2,4-6,8]"	# [n-m,l-k,...], where n < m and l < k, etc.
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # The default index value of an individual OST is its facet number minus 1.
2021-02-19T15:00:45 # More specific ones override more general ones. See facet_index().
2021-02-19T15:00:45 
2021-02-19T15:00:45 NETTYPE=${NETTYPE:-tcp}
2021-02-19T15:00:45 MGSNID=${MGSNID:-$(h2nettype $mgs_HOST)}
2021-02-19T15:00:45 
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # Back end file system type(s) of facets can be specified with these
2021-02-19T15:00:45 # variables:
2021-02-19T15:00:45 #
2021-02-19T15:00:45 #   1. <facet>_FSTYPE
2021-02-19T15:00:45 #   2. <facet_type>FSTYPE
2021-02-19T15:00:45 #   3. FSTYPE
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # More specific ones override more general ones.  See facet_fstype().
2021-02-19T15:00:45 #
2021-02-19T15:00:45 FSTYPE=${FSTYPE:-ldiskfs}
2021-02-19T15:00:45 
2021-02-19T15:00:45 LDISKFS_MKFS_OPTS=${LDISKFS_MKFS_OPTS:-}
2021-02-19T15:00:45 ZFS_MKFS_OPTS=${ZFS_MKFS_OPTS:-}
2021-02-19T15:00:45 
2021-02-19T15:00:45 LOAD_MODULES_REMOTE=${LOAD_MODULES_REMOTE:-false}
2021-02-19T15:00:45 
2021-02-19T15:00:45 DEF_STRIPE_SIZE=${DEF_STRIPE_SIZE:-}   # filesystem default stripe size in bytes
2021-02-19T15:00:45 DEF_STRIPE_COUNT=${DEF_STRIPE_COUNT:-} # filesystem default stripe count
2021-02-19T15:00:45 TIMEOUT=${TIMEOUT:-20}
2021-02-19T15:00:45 PTLDEBUG=${PTLDEBUG:-"vfstrace rpctrace dlmtrace neterror ha config \
2021-02-19T15:00:45 		      ioctl super lfsck"}
2021-02-19T15:00:45 SUBSYSTEM=${SUBSYSTEM:-"all"}
2021-02-19T15:00:45 
2021-02-19T15:00:45 # promise 2MB for every cpu
2021-02-19T15:00:45 if [ -f /sys/devices/system/cpu/possible ]; then
2021-02-19T15:00:45     _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*2))
2021-02-19T15:00:45 else
2021-02-19T15:00:45     _debug_mb=$(($(getconf _NPROCESSORS_CONF)*2))
2021-02-19T15:00:45 fi
2021-02-19T15:00:45 
2021-02-19T15:00:45 DEBUG_SIZE=${DEBUG_SIZE:-$_debug_mb}
2021-02-19T15:00:45 
2021-02-19T15:00:45 ENABLE_QUOTA=${ENABLE_QUOTA:-""}
2021-02-19T15:00:45 QUOTA_TYPE=${QUOTA_TYPE:-"ug3"}
2021-02-19T15:00:45 QUOTA_USERS=${QUOTA_USERS:-"quota_usr quota_2usr sanityusr sanityusr1"}
2021-02-19T15:00:45 # "error: conf_param: No such device" issue in every test suite logs
2021-02-19T15:00:45 # sanity-quota test_32 hash_lqs_cur_bits is not set properly
2021-02-19T15:00:45 LQUOTAOPTS=${LQUOTAOPTS:-"hash_lqs_cur_bits=3"}
2021-02-19T15:00:45 
2021-02-19T15:00:45 #client
2021-02-19T15:00:45 MOUNT=${MOUNT:-/mnt/${FSNAME}}
2021-02-19T15:00:45 MOUNT1=${MOUNT1:-$MOUNT}
2021-02-19T15:00:45 MOUNT2=${MOUNT2:-${MOUNT}2}
2021-02-19T15:00:45 MOUNT3=${MOUNT3:-${MOUNT}3}
2021-02-19T15:00:45 # Comma-separated option list used as "mount [...] -o $MOUNT_OPTS [...]"
2021-02-19T15:00:45 MOUNT_OPTS=${MOUNT_OPTS:-"user_xattr,flock"}
2021-02-19T15:00:45 # Mount flags (e.g. "-n") used as "mount [...] $MOUNT_FLAGS [...]"
2021-02-19T15:00:45 MOUNT_FLAGS=${MOUNT_FLAGS:-""}
2021-02-19T15:00:45 DIR=${DIR:-$MOUNT}
2021-02-19T15:00:45 DIR1=${DIR:-$MOUNT1}
2021-02-19T15:00:45 DIR2=${DIR2:-$MOUNT2}
2021-02-19T15:00:45 DIR3=${DIR3:-$MOUNT3}
2021-02-19T15:00:45 
2021-02-19T15:00:45 if [ $UID -ne 0 ]; then
2021-02-19T15:00:45         log "running as non-root uid $UID"
2021-02-19T15:00:45         RUNAS_ID="$UID"
2021-02-19T15:00:45         RUNAS_GID=`id -g $USER`
2021-02-19T15:00:45         RUNAS=""
2021-02-19T15:00:45 else
2021-02-19T15:00:45         RUNAS_ID=${RUNAS_ID:-500}
2021-02-19T15:00:45         RUNAS_GID=${RUNAS_GID:-$RUNAS_ID}
2021-02-19T15:00:45         RUNAS=${RUNAS:-"runas -u $RUNAS_ID -g $RUNAS_GID"}
2021-02-19T15:00:45 fi
2021-02-19T15:00:45 
2021-02-19T15:00:45 PDSH=${PDSH:-no_dsh}
2021-02-19T15:00:45 FAILURE_MODE=${FAILURE_MODE:-SOFT} # or HARD
2021-02-19T15:00:45 POWER_DOWN=${POWER_DOWN:-"powerman --off"}
2021-02-19T15:00:45 POWER_UP=${POWER_UP:-"powerman --on"}
2021-02-19T15:00:45 SLOW=${SLOW:-no}
2021-02-19T15:00:45 FAIL_ON_ERROR=${FAIL_ON_ERROR:-true}
2021-02-19T15:00:45 
2021-02-19T15:00:45 MPIRUN=${MPIRUN:-$(which mpirun 2>/dev/null || true)}
2021-02-19T15:00:45 MPI_USER=${MPI_USER:-mpiuser}
2021-02-19T15:00:45 SHARED_DIR_LOGS=${SHARED_DIR_LOGS:-""}
2021-02-19T15:00:45 MACHINEFILE_OPTION=${MACHINEFILE_OPTION:-"-machinefile"}
2021-02-19T15:00:45 
2021-02-19T15:00:45 # This is used by a small number of tests to share state between the client
2021-02-19T15:00:45 # running the tests, or in some cases between the servers (e.g. lfsck.sh).
2021-02-19T15:00:45 # It needs to be a non-lustre filesystem that is available on all the nodes.
2021-02-19T15:00:45 SHARED_DIRECTORY=${SHARED_DIRECTORY:-$TMP}	# bug 17839 comment 65
2021-02-19T15:00:45 
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # In order to test multiple remote HSM agents, a new facet type named "AGT" and
2021-02-19T15:00:45 # the following associated variables are added:
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # AGTCOUNT: number of agents
2021-02-19T15:00:45 # AGTDEV{N}: target HSM mount point (root path of the backend)
2021-02-19T15:00:45 # agt{N}_HOST: hostname of the agent agt{N}
2021-02-19T15:00:45 # SINGLEAGT: facet of the single agent
2021-02-19T15:00:45 #
2021-02-19T15:00:45 # Please refer to init_agt_vars() in sanity-hsm.sh for the default values of
2021-02-19T15:00:45 # these variables.
2021-02-19T15:00:45 #
Comment by Sebastien Buisson [ 23/Feb/21 ]

This tip is really helpful, thanks.

I looked for differences between the config files used in the 3 following test cases:

  • review-dne-ssk on CentOS 8.3: SSK setup fails
  • review-dne-ssk on CentOS 7.9: SSK setup succeeds
  • review-dne-selinux-ssk on CentOS 8.3: SSK setup succeeds

Config files are almost similar, the only relevant difference being the PDSH variable. For review-dne-ssk on CentOS 7.9 and CentOS 8.3, it is:

PDSH="pdsh -t 120 -S -Rmrsh -w"

For review-dne-selinux-ssk on CentOS 8.3, it is:

PDSH="pdsh -t 120 -S -Rssh -w"

So the combo for SSK failure is CentOS 8.3 + mrsh.

I tried to push a patch to have review-dne-ssk run on CentOS 8.3 with env=PDSH="pdsh -t 120 -S -Rssh -w", but as can be seen in the test logs at https://testing.whamcloud.com/test_sessions/e0bc0219-f623-45b4-954e-fe4b03de7b93, this value gets overwritten by the default one, so it is not conclusive:

2021-02-23T10:53:25 export PDSH="pdsh -t 120 -S -Rssh -w"
2021-02-23T10:53:25 export PDSH="pdsh -t 120 -S -Rmrsh -w"

However, I managed to reproduce manually the review-dne-ssk failure on trevis, just by using mrsh instead of ssh: if I set PDSH="pdsh -t 120 -S -Rssh -w" in my cfg file, SSK is setup properly by llmount.sh, but if I set PDSH="pdsh -t 120 -S -Rmrsh -w" in my cfg file, then it fails.

I have opened ATM-1962 to request switch from mrsh to ssh for pdsh rcmd module, but I do not know if maybe colmstea or leonel8a have a way to trigger a review-dne-ssk test with this beforehand, in order to confirm that it fixes the problem.

Comment by Sebastien Buisson [ 25/Feb/21 ]

Now that Charlie has landed fix for ATM-1962, review-dne-ssk passes successfully. I think this ticket can be closed.

Comment by Andreas Dilger [ 25/Feb/21 ]

Do we need to re-enable the review-dne-ssk series as enforced again?

Comment by Charlie Olmstead [ 25/Feb/21 ]

I re-enabled it as enforced yesterday morning

Generated at Sat Feb 10 03:09:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.