[LU-7118] sanity-scrub: No sub tests failed in this test set Created: 08/Sep/15 Updated: 22/Sep/15 Resolved: 22/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> I've seen a lot of sanity-scrub instances entirely fail lately. Looks like some kind of TEI issue to me as it shows up in test runs on lots of different and unrelated mods. no logs are collected, summary always says: Failed subtests No sub tests failed in this test set. All subtests This test set does not have any sub tests. Maybe something really bad landed that blocks any sanity-scrub from running. This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/aaf64806-5682-11e5-a9bc-5254006e85c2. |
| Comments |
| Comment by Andreas Dilger [ 09/Sep/15 ] |
|
This can often happen if there was a problem during cleanup/unmount/module unload after all the subtests have finished. |
| Comment by James Nunez (Inactive) [ 09/Sep/15 ] |
|
In the suite_stdout logs from the failure for this ticket: 23:31:38:shadow-40vm8: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown There are no other logs to see what the actual problem with the OSS might be. It looks like the previous test suite, sanity-quota, completed with no errors. This failure matches the one in There are many instances of sanity-scrub failing, here are a few instances on review-dne-part-2: |
| Comment by Bob Glossman (Inactive) [ 09/Sep/15 ] |
|
another on master: |
| Comment by Sarah Liu [ 09/Sep/15 ] |
|
I have seen similar issue on tip of master for RHEL7.1 server. Not sure if they are the same. |
| Comment by Bob Glossman (Inactive) [ 09/Sep/15 ] |
|
Sarah, I think your Having the console logs misplaced may be a good thing in this case. It preserved them and lets us see info not captured during failures on el6. |
| Comment by Joseph Gmitter (Inactive) [ 10/Sep/15 ] |
|
Can you please have a look at this issue? Can you dig into it further to get more info and see what may be happening? |
| Comment by Andreas Dilger [ 10/Sep/15 ] |
|
There are no console logs available to show what is happening. The MDT and OST were just reformatted and mounted so it is strange that there is an error. |
| Comment by Jian Yu [ 10/Sep/15 ] |
|
The first occurrence of this failure is in report https://testing.hpdd.intel.com/test_sets/aaf64806-5682-11e5-a9bc-5254006e85c2 of patch http://review.whamcloud.com/16315, which is based on 01ca899324738343279c1d63823b7fab937197dc ( |
| Comment by Peter Jones [ 10/Sep/15 ] |
|
Yang Sheng Could you please investigate? Thanks Peter |
| Comment by Jian Yu [ 10/Sep/15 ] |
|
This report https://testing.hpdd.intel.com/test_sets/5bd4b8c2-56cb-11e5-84d0-5254006e85c2 (failed on 2015-09-09) of http://review.whamcloud.com/16129 patch set 4 is based on commit 1f4d68d334c85a8106f5939351991b80449e5713 ( Is the issue related to the recent RHEL 6.7 change in autotest system? |
| Comment by Sarah Liu [ 10/Sep/15 ] |
|
If this is a dup of |
| Comment by Bob Glossman (Inactive) [ 10/Sep/15 ] |
|
the misplaced console logs in the 17:23:00:[ 9529.925038] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: sanity-scrub ============----- Fri Sep 4 17:22:53 UTC 2015 17:23:23:[ 9530.121771] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-scrub ============----- Fri Sep 4 17:22:53 UTC 2015 17:23:23:[ 9530.645738] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-scrub test complete, duration -o sec ======================================================= 17:22:53 \(1441387373\) 17:23:23:[ 9530.809134] Lustre: DEBUG MARKER: == sanity-scrub test complete, duration -o sec ======================================================= 17:22:53 (1441387373) nothing useful there that I can see. |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Hmm, I though the OSS console logs on 7:23:31:[ 9568.305881] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro 17:23:31:[ 9568.461183] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache 17:23:31:[ 9568.480543] Lustre: Evicted from MGS (at 10.1.4.162@tcp) after server handle changed from 0xd1a6ba6be92fc62 to 0xd1a6ba6be9adf3a 17:23:31:[ 9568.482736] LustreError: 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running? 17:23:31:[ 9568.482991] Lustre: MGC10.1.4.162@tcp: Connection restored to MGS (at 10.1.4.162@tcp) 17:23:31:[ 9568.482993] Lustre: Skipped 6 previous similar messages 17:23:31:[ 9568.485396] LustreError: 14130:0:(obd_mount_server.c:1794:server_fill_super()) Unable to start targets: -108 17:23:31:[ 9568.486579] LustreError: 14130:0:(obd_mount_server.c:1509:server_put_super()) no obd lustre-OST0000 17:23:31:[ 9568.487556] LustreError: 14130:0:(obd_mount_server.c:137:server_deregister_mount()) lustre-OST0000 not registered 17:23:31:[ 9568.620431] Lustre: server umount lustre-OST0000 complete It looks like MGC somehow survive umount/remount OST process. Since MDT0 is reformatted, so this survive mgc caused the mount failed. (or there might be some race here, sigh there are no further log here). 17:23:31:[ 9568.480543] Lustre: Evicted from MGS (at 10.1.4.162@tcp) after server handle changed from 0xd1a6ba6be92fc62 to 0xd1a6ba6be9adf3a Sigh, nothing on MDS console log. |
| Comment by Di Wang [ 10/Sep/15 ] |
|
See mgc even start to reconnect before the first OST is mount, if the timestamp is correct in the console log( is it? ), that means MGC is left over from last umount, a bit strange. 17:23:31:[ 9567.529080] Lustre: 19295:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1441387392/real 1441387392] req@ffff88005deb1e00 x1511395716989360/t0(0) o250->MGC10.1.4.162@tcp@10.1.4.162@tcp:26/25 lens 520/544 e 0 to 1 dl 1441387403 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:23:31:[ 9567.531521] Lustre: 19295:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 17:23:31:[ 9567.570845] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P1 17:23:31:[ 9567.913313] Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1 17:23:31:[ 9568.305881] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro 17:23:31:[ 9568.461183] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache 17:23:31:[ 9568.480543] Lustre: Evicted from MGS (at 10.1.4.162@tcp) after server handle changed from 0xd1a6ba6be92fc62 to 0xd1a6ba6be9adf3a 17:23:31:[ 9568.482736] LustreError: 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running? 17:23:31:[ 9568.482991] Lustre: MGC10.1.4.162@tcp: Connection restored to MGS (at 10.1.4.162@tcp) |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Because it blocks almost all of patches on master, I changed it to blocker. |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Ah, there are 7 OSTs in the test config. But it seems sanity-scrub.sh only stops 4 OSTs, then do reformat & restart, that is why mgc is left over and cause all these troubles. # no need too much OSTs, to reduce the format/start/stop overhead
[ $OSTCOUNT -gt 4 ] && OSTCOUNT=4
MOUNT_2=""
# build up a clean test environment.
formatall
setupall
Probably we only need fix the test script here. |
| Comment by Gerrit Updater [ 10/Sep/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16366 |
| Comment by Bob Glossman (Inactive) [ 10/Sep/15 ] |
|
I like your theory, but it looks like that line reducing OSTCOUNT to 4 has been in there for over a year. Why only a problem recently? |
| Comment by Andreas Dilger [ 10/Sep/15 ] |
|
I was going to ask the same question... I don't see any changes to sanity-scrub.sh or test-framework.sh recently. |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Well, this only cause problem when remount OST (after reformat) happens after the leftover MGC is being evicted, otherwise we can not see this problem. (Hmm, it seems we need do extra check for MGC, or do we allow OST to use old MGC in this case, it seems not cause any problem if it is not being evicted). I guess the recent change just prolong the time costs of reformat & restart? Except the lustre change, do we change test environment recently? |
| Comment by Yang Sheng [ 11/Sep/15 ] |
|
I think we need someone from TEL team to investigate why test logs absence and why sanity-scrub doesn't run at all just return immediately. I have verified the test script runs well on shadow cluster. Do we have any documents about autotest system how to work? |
| Comment by Minh Diep [ 11/Sep/15 ] |
|
This is what I see. Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/ost1 The question here is why we can't mount the OST. why the ost disconnect due to transport endpoint shutdown? |
| Comment by Gerrit Updater [ 11/Sep/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16366/ |
| Comment by Andreas Dilger [ 11/Sep/15 ] |
|
This definitely seems like a regression that landed recently on master, since sanity-scrub has had the [ $OSTCOUNT -gt 4 ] && OSTCOUNT=4 line since commit v2_5_58_0-41-g1dbba32 and there are not any failures on b2_7 testing. It might be useful to run a series of tests with different commits going back once per day running review-ldiskfs multiple times (if this is possible): Test-Parameters: fortestonly testgroup=review-ldiskfs Test-Parameters: fortestonly testgroup=review-ldiskfs Test-Parameters: fortestonly testgroup=review-ldiskfs Test-Parameters: fortestonly testgroup=review-ldiskfs since sanity-scrub only fails about 50% of the time this would give us a 94% chance of catching the regression patch at each stage. The earliest failures I see outside RHEL7.1 testing is 2015-09-08 (http://review.whamcloud.com/16315 based on commit 01ca8993247383 " |
| Comment by Joseph Gmitter (Inactive) [ 15/Sep/15 ] |
|
Landed for 2.8. |
| Comment by Yang Sheng [ 17/Sep/15 ] |
|
https://testing.hpdd.intel.com/test_sets/bdf2ffa6-5cfb-11e5-945a-5254006e85c2 Looks like last patch still not resolved this issue. But it is really not so frequently. |
| Comment by Di Wang [ 17/Sep/15 ] |
|
Looks like still the same problem, there are 7 OSTs, and only 4 OSTs are stopped. CMD: shadow-49vm3 grep -c /mnt/mds1' ' /proc/mounts Stopping /mnt/mds1 (opts:-f) on shadow-49vm3 CMD: shadow-49vm3 umount -d -f /mnt/mds1 CMD: shadow-49vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm7 grep -c /mnt/mds2' ' /proc/mounts Stopping /mnt/mds2 (opts:-f) on shadow-49vm7 CMD: shadow-49vm7 umount -d -f /mnt/mds2 CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm7 grep -c /mnt/mds3' ' /proc/mounts Stopping /mnt/mds3 (opts:-f) on shadow-49vm7 CMD: shadow-49vm7 umount -d -f /mnt/mds3 CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm7 grep -c /mnt/mds4' ' /proc/mounts Stopping /mnt/mds4 (opts:-f) on shadow-49vm7 CMD: shadow-49vm7 umount -d -f /mnt/mds4 CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost1' ' /proc/mounts Stopping /mnt/ost1 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost1 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost2' ' /proc/mounts Stopping /mnt/ost2 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost2 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost3' ' /proc/mounts Stopping /mnt/ost3 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost3 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost4' ' /proc/mounts Stopping /mnt/ost4 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost4 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' I guess we need find out where the OSTCOUNT has been changed? or probably stopall should use "lov.xxxxx.numobd" instead of env var ? |
| Comment by Gerrit Updater [ 18/Sep/15 ] |
|
Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16483 |
| Comment by Peter Jones [ 22/Sep/15 ] |
|
Original regularly occurring failure fixed. Dealing with less frequent occasional failure under |