[LU-12171] sanity test_133g: Timeout occurred after 161 mins Created: 08/Apr/19 Updated: 19/Apr/19 Resolved: 18/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/43fba6f6-5a10-11e9-92fe-52540065bddc test_133g failed with the following error: Timeout occurred after 161 mins, last suite running was sanity, restarting cluster to continue tests trevis-14vm9 hosting MDT0000 becomes unreachable, and this situation cannot be recovered. From the MDS standpoint, client was evicted, but the client did not manage to reconnect. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Gu Zheng (Inactive) [ 15/Apr/19 ] |
|
another instance: https://testing.whamcloud.com/test_sessions/45917329-c33d-4929-9393-69b342c5f0eb |
| Comment by Patrick Farrell (Inactive) [ 17/Apr/19 ] |
|
https://testing.whamcloud.com/test_sessions/d70e90ee-cbfd-426e-b0b7-e82cad8bcfd4 |
| Comment by Patrick Farrell (Inactive) [ 17/Apr/19 ] |
|
So I took a look here... mds1 is getting failed over in all of the failure cases. In the cleanup for this, we check if mds1 is active on the expected server, and if it's not, we fail it. That's fine... In the timeout cases, mds1 is not showing as active, and so it's getting failed over - but it's getting failed over to the same VM it started from: cln..Failing mds1 on onyx-43vm9 CMD: onyx-43vm9 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on onyx-43vm9 CMD: onyx-43vm9 umount -d /mnt/lustre-mds1 CMD: onyx-43vm9 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true CMD: onyx-43vm9 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 || grep -q ^lustre-mdt1/ /proc/mounts || zpool export lustre-mdt1 reboot facets: mds1 Failover mds1 to onyx-43vm9 13:57:29 (1554731849) waiting for trevis-14vm9 network 900 secs ... 13:57:29 (1554731849) network interface is UP CMD: trevis-14vm9 hostname mount facets: mds1 This is quite odd. I can't see why this would happen... But it seems reasonable that self-failover like this, which isn't intended, might confuse something. (Can't figure out what yet.) This started on April 8th, and is limited to DNE testing. Given what we're seeing elsewhere, I'd lay money this is also fallout from |
| Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ] |
|
Dupe of LU-12175 |