[LU-12171] sanity test_133g: Timeout occurred after 161 mins Created: 08/Apr/19  Updated: 19/Apr/19  Resolved: 18/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-12175 sanity test 208 fails with 'lease bro... Reopened
Related
is related to LU-12210 Test failures associated with DNE ind... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/43fba6f6-5a10-11e9-92fe-52540065bddc

test_133g failed with the following error:

Timeout occurred after 161 mins, last suite running was sanity, restarting cluster to continue tests

trevis-14vm9 hosting MDT0000 becomes unreachable, and this situation cannot be recovered. From the MDS standpoint, client was evicted, but the client did not manage to reconnect.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_133g - Timeout occurred after 161 mins, last suite running was sanity, restarting cluster to continue tests



 Comments   
Comment by Gu Zheng (Inactive) [ 15/Apr/19 ]

another instance:

https://testing.whamcloud.com/test_sessions/45917329-c33d-4929-9393-69b342c5f0eb

Comment by Patrick Farrell (Inactive) [ 17/Apr/19 ]

https://testing.whamcloud.com/test_sessions/d70e90ee-cbfd-426e-b0b7-e82cad8bcfd4

Comment by Patrick Farrell (Inactive) [ 17/Apr/19 ]

So I took a look here...

mds1 is getting failed over in all of the failure cases.

In the cleanup for this, we check if mds1 is active on the expected server, and if it's not, we fail it.  That's fine...

In the timeout cases, mds1 is not showing as active, and so it's getting failed over - but it's getting failed over to the same VM it started from:

cln..Failing mds1 on onyx-43vm9
CMD: onyx-43vm9 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on onyx-43vm9
CMD: onyx-43vm9 umount -d /mnt/lustre-mds1
CMD: onyx-43vm9 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
CMD: onyx-43vm9 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			grep -q ^lustre-mdt1/ /proc/mounts ||
			zpool export  lustre-mdt1
reboot facets: mds1
Failover mds1 to onyx-43vm9 
13:57:29 (1554731849) waiting for trevis-14vm9 network 900 secs ...
13:57:29 (1554731849) network interface is UP
CMD: trevis-14vm9 hostname
mount facets: mds1

This is quite odd.  I can't see why this would happen...  But it seems reasonable that self-failover like this, which isn't intended, might confuse something.  (Can't figure out what yet.)

jamesanunez:

This started on April 8th, and is limited to DNE testing.  Given what we're seeing elsewhere, I'd lay money this is also fallout from LU-11636.

Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ]

Dupe of LU-12175

Generated at Sat Feb 10 02:50:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.