[LU-14153] sanity test_280: "mount client failed" on review-dne-ssk review-dne-selinux-ssk Created: 27/Nov/20  Updated: 25/Jan/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: SSK

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4a0c244b-cc64-455a-9d41-a8cf9790407c

test_280 failed with the following error:

== sanity test 280: Race between MGS umount and client llog processing =====
10.9.6.22@tcp:/lustre /mnt/lustre lustre rw,seclabel,flock,user_xattr,lazystatfs,noencrypt 0 0
CMD: trevis-45vm1.trevis.whamcloud.com grep -c /mnt/lustre' ' /proc/mounts
Stopping client trevis-45vm1.trevis.whamcloud.com /mnt/lustre (opts:)
CMD: trevis-45vm1.trevis.whamcloud.com lsof -t /mnt/lustre
CMD: trevis-45vm1.trevis.whamcloud.com umount  /mnt/lustre 2>&1
CMD: trevis-25vm1.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,skpath=/tmp/test-framework-keys trevis-25vm4@tcp:/lustre /mnt/lustre
:
Starting client: trevis-25vm1.trevis.whamcloud.com:  -o user_xattr,flock,skpath=/tmp/test-framework-keys trevis-25vm4@tcp:/lustre /mnt/lustre
CMD: trevis-25vm1.trevis.whamcloud.com mkdir -p /mnt/lustre
CMD: trevis-25vm1.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,skpath=/tmp/test-framework-keys trevis-25vm4@tcp:/lustre /mnt/lustre
mount.lustre: according to /etc/mtab trevis-25vm4@tcp:/lustre is already mounted on /mnt/lustre
 sanity test_280: @@@@@@ FAIL: mount client failed 

It looks like this is failing intermittently since 2020-07-30 (about 35 times over 4 months) for review-dne-ssk and review-dne-selinux-ssk sessions. Note that it does not happen for review-dne-selinux sessions, and the few failures on other test sessions look like they were related to many previous tests failing also.

It may just be a test script issue (e.g. SSK is causing the client mount to be slower, and the race that the subtest is trying to trigger is happening differently as a result).

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_280 - mount client failed



 Comments   
Comment by Andreas Dilger [ 27/Nov/20 ]

Comparing a passing and failing test run, it is clear that the failing test run has an additional "Starting client:" step, which leads to the "already mounted" error, because the passing test run shows that the client already has a mounted filesystem when mount_client() is called, so doesn't even try to mount it again:

diff -u /tmp/passed /tmp/failed
--- /tmp/passed	2020-11-26 22:38:25.000000000 -0700
+++ /tmp/failed	2020-11-26 22:38:26.000000000 -0700
@@ -33,6 +33,9 @@
 pdsh@trevis-10vm1: trevis-10vm4: ssh exited with exit code 1
 CMD: trevis-10vm4 e2label /dev/mapper/mds1_flakey 2>/dev/null
 Started lustre-MDT0000
-client@tcp:/lustre /mnt/lustre lustre rw,seclabel,flock,user_xattr,lazystatfs,noencrypt 0 0
-Resetting fail_loc on all nodes...CMD: trevis-10vm1.trevis.whamcloud.com,trevis-10vm2,trevis-10vm3,trevis-10vm4,trevis-10vm5 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
-done.
+Starting client: trevis-10vm1.trevis.whamcloud.com:  -o user_xattr,flock,skpath=/tmp/test-framework-keys trevis-10vm4@tcp:/lustre /mnt/lustre
+CMD: trevis-10vm1.trevis.whamcloud.com mkdir -p /mnt/lustre
+CMD: trevis-10vm1.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,skpath=/tmp/test-framework-keys trevis-10vm4@tcp:/lustre /mnt/lustre
+mount.lustre: according to /etc/mtab trevis-10vm4@tcp:/lustre is already mounted on /mnt/lustre
+ sanity test_280: @@@@@@ FAIL: mount client failed 

This definitely seems like a race in the test, since the failed test doesn't detect the mount in "mount_client()" but finds it later when zconf_mount() tries to mount.

Comment by Artem Blagodarenko (Inactive) [ 11/Dec/20 ]

+1 https://testing.whamcloud.com/test_sets/2f812d63-e312-49fe-a7e7-67df1e801949

Comment by Emoly Liu [ 20/Jan/21 ]

+1 on master: https://testing.whamcloud.com/test_sets/32b28fef-bddf-422b-9c73-e677bce9cc50

Generated at Sat Feb 10 03:07:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.